docs: add legacy workflow and migration guidance

This commit is contained in:
mbsantiago 2026-04-30 11:48:25 +01:00
parent 300716895e
commit a2f2a2d398
7 changed files with 677 additions and 0 deletions

441
docs/plan.md Normal file
View File

@ -0,0 +1,441 @@
# Documentation Plan
## Goal
Build documentation around the main user stories:
1. Run inference with the CLI on one folder of audio.
2. Use the Python API for inference with fine-grained control over outputs,
including per-file workflows, class scores, features, and batch processing.
3. Train or fine-tune a custom model.
4. Evaluate a model and understand what the metrics mean.
5. Understand the concepts needed to use BatDetect2 correctly.
The docs should provide:
- a simple happy path in tutorials,
- richer task-oriented guidance in how-to guides,
- complete lookup material in reference,
- deep conceptual coverage in understanding.
Note: the current docs tree uses `explanation/`. For Diataxis consistency, this
plan uses `understanding/` as the target name for that conceptual section.
## Current State Review
### Looks reasonably complete
- `docs/source/index.md`: good top-level orientation and navigation.
- `docs/source/getting_started.md`: solid install and entry-point guidance.
- `docs/source/explanation/*.md`: the conceptual pages are currently the
strongest part of the docs, especially pipeline overview, thresholds,
preprocessing consistency, and targets.
- `docs/source/how_to/configure-*.md` and related target/data pages: practical
support docs for preprocessing, targets, ROI mapping, and dataset formats are
in decent shape.
- `docs/source/reference/cli/*.rst`: CLI reference wiring exists and should
render useful option-level documentation from the Click commands.
### Partially complete
- `docs/source/how_to/run-batch-predictions.md`: useful, but thin.
- `docs/source/how_to/tune-detection-threshold.md`: useful, but too brief for
a key workflow.
- `docs/source/reference/preprocessing-config.md`
- `docs/source/reference/postprocess-config.md`
- `docs/source/reference/targets-config-workflow.md`
These are good summaries, but they do not yet feel like complete references for
all the customization surfaces available in the code.
### Clearly incomplete or scaffolded
- `docs/source/tutorials/run-inference-on-folder.md`
- `docs/source/tutorials/integrate-with-a-python-pipeline.md`
- `docs/source/tutorials/train-a-custom-model.md`
- `docs/source/tutorials/evaluate-on-a-test-set.md`
All four main tutorials are still starter scaffolds. This is the biggest gap in
the current user story.
### Major mismatch to resolve
- `README.md` still tells an older story built around `batdetect2 detect` and
`batdetect2.api`.
- The docs site tells the newer story built around `batdetect2 predict` and
`batdetect2.api_v2`.
This creates avoidable confusion for users and should be treated as a priority
documentation alignment issue.
### Legacy documentation is not yet placed clearly
The repo still contains meaningful legacy documentation material, but it is not
yet presented as a clearly marked legacy path inside the docs.
Users need two things:
- a clear message that these docs exist for the previous BatDetect2 workflow,
- a clear recommendation that new users should prefer the newer CLI/API
workflows and migrate where possible.
## Legacy Documentation Plan
### Goals
1. Preserve access to the old workflow documentation.
2. Prevent new users from accidentally following legacy guidance.
3. Give current users a clear migration path from legacy to current workflows.
### Proposed location
Add a dedicated legacy area inside the docs, for example:
- `docs/source/legacy/index.md`
- `docs/source/legacy/cli-detect.md`
- `docs/source/legacy/python-api.md`
- `docs/source/legacy/feature-extraction.md`
- `docs/source/legacy/migration-guide.md`
This keeps the material available without mixing it into the main happy-path
docs.
### User-facing messaging
Add clear notices in all relevant navigation entry points.
Suggested message pattern:
"If you want to use the previous version of BatDetect2, see the legacy
documentation. For new workflows, we recommend using the current `predict`
CLI and `BatDetect2API` interfaces."
Places that should link to the legacy docs:
- `docs/source/index.md`
- `docs/source/getting_started.md`
- `README.md`
- tutorial landing pages where users may be coming from older workflows
- any page that mentions the old `detect` command or old Python API
### Migration guide plan
Add a dedicated migration guide that explains:
1. who should migrate now and who may need to stay on the legacy workflow,
2. the mapping from old CLI commands to new CLI commands,
3. the mapping from old Python API calls to new `api_v2` / `BatDetect2API`
patterns,
4. what changed in outputs, terminology, and configuration,
5. how legacy feature extraction concepts map to the new API surfaces,
6. what behavior differences users should validate before switching,
7. a short migration checklist.
High-priority migration mappings to document:
- `batdetect2 detect` -> `batdetect2 predict directory`
- old `batdetect2.api` file processing -> `BatDetect2API.from_checkpoint(... )`
plus `process_file`, `process_files`, `process_audio`, or
`process_spectrogram`
- legacy `cnn_feats`, `spec_features`, and `spec_slices` -> current output and
feature access patterns, with explicit notes where there is no direct
one-to-one replacement
### Legacy content handling plan
For each legacy page or legacy concept:
1. Decide whether it should be preserved as-is, rewritten as a legacy page, or
replaced by the migration guide.
2. Add a prominent warning banner saying it describes the previous workflow.
3. Link forward to the current equivalent page when one exists.
### Definition of done for legacy handling
Legacy documentation work is done when:
1. a reader can clearly distinguish legacy from current docs,
2. old users can still find the previous workflow documentation,
3. new users are consistently directed to the new docs,
4. there is a practical migration guide covering the main CLI and Python API
transitions.
## Main Gaps By User Story
### 1. CLI inference
Current coverage exists, but the happy path is not truly documented yet.
Missing:
- a full worked tutorial from input audio to saved outputs,
- clear guidance on what outputs are written and how to inspect them,
- stronger documentation for `predict dataset`,
- a clearer story for default model vs custom checkpoint,
- practical guidance for selecting output formats and thresholds.
### 2. Python API inference
This is currently the weakest major story.
The code exposes much more than the docs explain, including:
- `BatDetect2API.from_checkpoint` and `from_config`,
- `process_file`, `process_files`, `process_directory`, `process_clips`,
- `process_audio`, `process_spectrogram`,
- `get_top_class_name`, `get_class_scores`, `get_detection_features`,
- `save_predictions` and `load_predictions`.
Missing docs:
- an API-first tutorial with a simple path,
- a how-to for file-by-file inspection and custom post-processing,
- a how-to for batch API inference,
- a reference page for `BatDetect2API`,
- an explanation of what the feature vectors are and how users should think
about them.
Important terminology note:
- the old API/docs talk about `cnn_feats`, `spec_features`, and `spec_slices`,
- the new API exposes per-detection `features`,
- users interested in embeddings / downstream exploration will need a clear,
explicit doc that connects these ideas.
### 3. Batch inference
Batch prediction exists in both CLI and API workflows, but the docs do not yet
explain the design space well.
Missing:
- when to use `directory` vs `file_list` vs `dataset`,
- how clipping works during inference,
- what `InferenceConfig` controls,
- how batch size, workers, and output format choices affect runs,
- how to organize large runs reproducibly.
### 4. Training a custom model
Supporting pages exist, but the end-to-end story is not yet there.
Missing:
- one complete tutorial from dataset config to checkpoints and sanity check,
- a "minimum viable training setup" page,
- clearer explanation of how model, targets, audio, training, inference,
outputs, and logging configs fit together,
- a fine-tuning story versus training from scratch.
### 5. Evaluation
Evaluation is significantly under-documented relative to the code.
Missing:
- what evaluation tasks exist,
- what metrics and plots are produced,
- how predictions are matched to annotations,
- how to interpret failures and trade-offs,
- how to configure evaluation for different research questions.
### 6. Understanding / concepts
This is the best-developed section today, but it still needs expansion.
Concepts that should be covered more fully:
- what the model predicts,
- what the raw and formatted outputs represent,
- how to interpret detection scores and class scores,
- what targets are and how they shape training and decoding,
- how preprocessing choices affect model behavior,
- what the extracted features represent and when they are useful,
- what evaluation metrics actually measure,
- why local validation is required before ecological inference.
## Proposed Documentation Architecture
## Target Table of Contents
### Home
- Home
- Getting started
- FAQ
- Legacy docs
### Tutorials
These should be the default path for most users.
- Tutorial: Run inference on a folder of audio
- Tutorial: Explore predictions in Python for one file
- Tutorial: Train a custom model
- Tutorial: Evaluate a trained model
### How-to Guides
These cover practical tasks once the user is past the happy path.
- How to choose an inference input mode
- How to run batch predictions from a directory
- How to run batch predictions from a file list
- How to run predictions from a dataset config
- How to tune detection thresholds
- How to inspect class scores in Python
- How to inspect detection features in Python
- How to save predictions in different output formats
- How to configure inference clipping
- How to configure audio preprocessing
- How to configure spectrogram preprocessing
- How to configure target definitions
- How to define target classes
- How to configure ROI mapping
- How to configure an AOEF dataset
- How to import legacy BatDetect2 annotations
- How to fine-tune from a checkpoint
- How to choose and configure evaluation tasks
- How to interpret evaluation outputs
### Reference
This should be the complete lookup layer.
- CLI reference
- CLI reference: base command and global options
- CLI reference: predict
- CLI reference: data
- CLI reference: train
- CLI reference: evaluate
- CLI reference: legacy detect
- API reference: `BatDetect2API`
- Config reference: top-level app config
- Config reference: inference config
- Config reference: evaluation config
- Config reference: outputs config
- Config reference: output formats
- Config reference: output transforms
- Config reference: preprocessing config
- Config reference: postprocess config
- Config reference: targets config workflow
- Reference: data sources
- Reference: targets module
### Understanding
This is the conceptual layer and should carry the deeper Diataxis
"understanding" material.
- What BatDetect2 predicts
- How the pipeline fits together
- How to interpret detection scores and class scores
- How to interpret formatted outputs
- What extracted features / embeddings are and are not
- Postprocessing and thresholds
- Preprocessing consistency and domain shift
- Target encoding and decoding
- Evaluation concepts and matching behavior
- Model output, validation, and ecological interpretation
### Legacy
This is a clearly signposted area for the previous workflow only.
- Legacy overview
- Legacy CLI workflow with `batdetect2 detect`
- Legacy Python API with `batdetect2.api`
- Legacy feature extraction outputs
- Migration guide: legacy to current workflows
### Tutorials
Keep tutorials opinionated and minimal. Each one should show the default happy
path with the fewest possible choices.
Planned tutorial set:
1. Run inference on a folder of audio.
2. Explore predictions in Python for one file.
3. Train a custom model.
4. Evaluate a trained model.
### How-to Guides
Use how-to guides for branching tasks and customization.
Planned additions or expansions:
- Choose an inference input mode: directory, file list, or dataset.
- Run large batch inference reproducibly.
- Save predictions in different output formats.
- Inspect class scores and features in Python.
- Explore detection features / embeddings downstream.
- Tune clipping and inference settings.
- Fine-tune from a checkpoint.
- Choose and configure evaluation tasks.
- Interpret evaluation artifacts.
### Reference
Reference should become the complete map of all configurable surfaces.
High-priority additions:
- `BatDetect2API` reference.
- `InferenceConfig` reference.
- `EvaluationConfig` reference.
- `OutputsConfig` and output format reference.
- Output transform reference.
- clearer config composition reference for the full app config.
### Understanding
This is where the deeper conceptual material should live.
High-priority pages:
1. What BatDetect2 predicts.
2. How to interpret outputs, scores, and uncertainty.
3. What extracted features / embeddings are and are not.
4. Targets, labels, and decoded outputs.
5. Preprocessing consistency and domain shift.
6. Postprocessing, thresholds, and output density.
7. How evaluation works and what the metrics mean.
8. Why local validation is required before ecological interpretation.
## Priority Order
### Phase 1: Fix the primary user journey
1. Expand the four scaffold tutorials into real end-to-end guides.
2. Add a proper Python/API inference story.
3. Document outputs and how to inspect them.
4. Align `README.md` with the newer CLI/API documentation story.
5. Create the legacy docs section and add clear signposting to it.
### Phase 2: Cover the customization surface
1. Add how-to guides for batch inference, output formats, and API inspection.
2. Add reference pages for inference, outputs, evaluation, and API surfaces.
3. Add fine-tuning and advanced training guidance.
4. Write the migration guide from legacy to current workflows.
### Phase 3: Deepen understanding
1. Expand the conceptual section into a true understanding section.
2. Add pages for output interpretation, features/embeddings, and evaluation
concepts.
3. Reader-test the docs against realistic user questions.
## Immediate Next Steps
1. Decide whether to rename `explanation/` to `understanding/` or keep the
current directory name and just treat it as the Diataxis understanding
section.
2. Draft the target table of contents for Tutorials, How-to, Reference, and
Understanding.
3. Draft the legacy docs section and migration-guide table of contents.
4. Rewrite the four scaffold tutorials first.
5. Add the missing API, outputs, evaluation, and migration documentation
immediately after.

View File

View File

@ -0,0 +1,39 @@
# Legacy CLI workflow: `batdetect2 detect`
This page documents the previous CLI workflow based on `batdetect2 detect`.
```{warning}
This is legacy documentation.
For new workflows, use `batdetect2 predict directory` instead.
If you are migrating, start with {doc}`migration-guide`.
```
## Legacy command shape
```bash
batdetect2 detect AUDIO_DIR ANN_DIR DETECTION_THRESHOLD
```
Common legacy options included:
- `--cnn_features`
- `--spec_features`
- `--time_expansion_factor`
- `--save_preds_if_empty`
- `--model_path`
## Current replacement
The closest current CLI entry point is:
```bash
batdetect2 predict directory \
path/to/model.ckpt \
path/to/audio_dir \
path/to/outputs
```
## Related pages
- Migration guide: {doc}`migration-guide`
- Current predict docs: {doc}`../reference/cli/predict`

View File

@ -0,0 +1,34 @@
# Legacy feature extraction outputs
The previous BatDetect2 workflow exposed several output concepts that users may still rely on.
These included:
- `cnn_feats`
- `spec_features`
- `spec_slices`
## Why this matters
Users exploring older notebooks or downstream analysis code often encounter these names first.
The current stack exposes a different surface centered on per-detection `features` plus configurable output formatters.
## Migration note
There is not always a strict one-to-one replacement.
When migrating, validate which part of the old workflow you actually need:
- low-level exported features,
- spectrogram slices,
- model-internal feature vectors,
- legacy JSON output shape.
Then map that need onto the current API and output format configuration.
## Related pages
- Migration guide: {doc}`migration-guide`
- Current features explanation: {doc}`../explanation/extracted-features-and-embeddings`
- Output formats reference: {doc}`../reference/output-formats`

View File

@ -0,0 +1,27 @@
# Legacy documentation
This section documents the previous BatDetect2 workflow.
Use these pages if you need to keep working with the older `batdetect2 detect` command or the older `batdetect2.api` interface.
For new projects, we recommend the current workflow:
- CLI: `batdetect2 predict`
- Python: `batdetect2.api_v2.BatDetect2API`
If you are moving from the older workflow, start with {doc}`migration-guide`.
```{warning}
These pages describe the previous workflow.
They are kept for continuity and migration support.
New users should start with {doc}`../getting_started` and {doc}`../tutorials/index`.
```
```{toctree}
:maxdepth: 1
cli-detect
python-api
feature-extraction
migration-guide
```

View File

@ -0,0 +1,96 @@
# Migration guide: legacy to current workflows
Use this guide when moving from the previous BatDetect2 workflow to the current CLI and API.
## Who should migrate now
You should migrate if:
- you are starting a new workflow,
- you want the current docs path,
- you want the newer CLI and API surface,
- you are maintaining code that does not depend on the exact legacy JSON or feature outputs.
You may need the legacy workflow a bit longer if:
- downstream tooling depends on the exact old output structure,
- you rely on older notebooks built around `batdetect2.api`,
- you depend on legacy feature extraction outputs without a validated replacement yet.
## CLI mapping
- `batdetect2 detect AUDIO_DIR ANN_DIR DETECTION_THRESHOLD`
-> `batdetect2 predict directory MODEL_PATH AUDIO_DIR OUTPUT_PATH --detection-threshold ...`
Main changes:
- the model path is now a positional argument on the `predict` subcommand,
- the current workflow expects an explicit checkpoint path rather than silently relying on the old default CLI behavior,
- output formatting is configurable,
- threshold override is an option rather than a required positional argument,
- there are separate subcommands for directory, file-list, and dataset-driven inference.
## Python API mapping
- old: `import batdetect2.api as api`
- current: `from batdetect2.api_v2 import BatDetect2API`
Typical migration shape:
```python
from pathlib import Path
from batdetect2.api_v2 import BatDetect2API
api = BatDetect2API.from_checkpoint(Path("path/to/model.ckpt"))
prediction = api.process_file(Path("path/to/audio.wav"))
```
Useful replacements:
- legacy `process_file` -> current `BatDetect2API.process_file`
- legacy `process_audio` -> current `BatDetect2API.process_audio`
- legacy `process_spectrogram` -> current `BatDetect2API.process_spectrogram`
- legacy one-off batch loops -> current `process_files` or CLI `predict`
## Output and terminology changes
Legacy workflows often centered on:
- BatDetect2-style JSON output,
- `cnn_feats`,
- `spec_features`,
- `spec_slices`.
Current workflows center on:
- `ClipDetections` and `Detection` objects,
- per-detection `detection_score`,
- per-detection `class_scores`,
- per-detection `features`,
- configurable output formatters.
## What to validate after migration
Before replacing a legacy workflow in production or research analysis, validate:
- that thresholds are still appropriate,
- that outputs are being saved in the right format,
- that downstream code reads the new outputs correctly,
- that feature-related assumptions still hold,
- that evaluation and ecological interpretation are unchanged only where you have actually verified that.
## Migration checklist
1. Identify the old entry points you use.
2. Replace them with the current CLI or `BatDetect2API` equivalents.
3. Choose an output format explicitly.
4. Re-run on a small reviewed subset.
5. Compare outputs and downstream behavior.
6. Update any notebooks or scripts that assume legacy field names.
## Related pages
- Current getting started: {doc}`../getting_started`
- Current tutorials: {doc}`../tutorials/index`
- Current API reference: {doc}`../reference/api`

View File

@ -0,0 +1,40 @@
# Legacy Python API: `batdetect2.api`
This page documents the previous Python API workflow based on `batdetect2.api`.
```{warning}
This is legacy documentation.
For new workflows, use `batdetect2.api_v2.BatDetect2API`.
If you are migrating, start with {doc}`migration-guide`.
```
## Legacy entry points
Common legacy functions included:
- `process_file`
- `process_audio`
- `process_spectrogram`
- `load_audio`
- `generate_spectrogram`
- `postprocess`
The legacy API also exposed the default model and default config more directly.
## Current replacement
The current Python path is:
```python
from pathlib import Path
from batdetect2.api_v2 import BatDetect2API
api = BatDetect2API.from_checkpoint(Path("path/to/model.ckpt"))
prediction = api.process_file(Path("path/to/audio.wav"))
```
## Related pages
- Migration guide: {doc}`migration-guide`
- Current API reference: {doc}`../reference/api`