From 300716895efa3f2f484696b21affc082afa0788a Mon Sep 17 00:00:00 2001
From: mbsantiago <santiago.mbal@gmail.com>
Date: Thu, 30 Apr 2026 11:48:19 +0100
Subject: [PATCH] docs: add task guides and API/config references

---
 .../evaluation-concepts-and-matching.md       | 48 ++++++++++++++
 .../extracted-features-and-embeddings.md      | 36 ++++++++++
 .../interpreting-formatted-outputs.md         | 36 ++++++++++
 .../explanation/what-batdetect2-predicts.md   | 45 +++++++++++++
 .../how_to/choose-an-inference-input-mode.md  | 66 +++++++++++++++++++
 .../choose-and-configure-evaluation-tasks.md  | 66 +++++++++++++++++++
 .../how_to/fine-tune-from-a-checkpoint.md     | 45 +++++++++++++
 .../how_to/inspect-class-scores-in-python.md  | 44 +++++++++++++
 .../inspect-detection-features-in-python.md   | 49 ++++++++++++++
 .../how_to/interpret-evaluation-outputs.md    | 41 ++++++++++++
 docs/source/how_to/run-batch-predictions.md   | 31 ++++++++-
 ...predictions-in-different-output-formats.md | 64 ++++++++++++++++++
 .../source/how_to/tune-detection-threshold.md | 15 +++++
 docs/source/how_to/tune-inference-clipping.md | 63 ++++++++++++++++++
 docs/source/reference/api.md                  | 65 ++++++++++++++++++
 docs/source/reference/app-config.md           | 38 +++++++++++
 docs/source/reference/evaluation-config.md    | 46 +++++++++++++
 docs/source/reference/inference-config.md     | 41 ++++++++++++
 docs/source/reference/output-formats.md       | 63 ++++++++++++++++++
 docs/source/reference/output-transforms.md    | 37 +++++++++++
 docs/source/reference/outputs-config.md       | 33 ++++++++++
 21 files changed, 971 insertions(+), 1 deletion(-)
 create mode 100644 docs/source/explanation/evaluation-concepts-and-matching.md
 create mode 100644 docs/source/explanation/extracted-features-and-embeddings.md
 create mode 100644 docs/source/explanation/interpreting-formatted-outputs.md
 create mode 100644 docs/source/explanation/what-batdetect2-predicts.md
 create mode 100644 docs/source/how_to/choose-an-inference-input-mode.md
 create mode 100644 docs/source/how_to/choose-and-configure-evaluation-tasks.md
 create mode 100644 docs/source/how_to/fine-tune-from-a-checkpoint.md
 create mode 100644 docs/source/how_to/inspect-class-scores-in-python.md
 create mode 100644 docs/source/how_to/inspect-detection-features-in-python.md
 create mode 100644 docs/source/how_to/interpret-evaluation-outputs.md
 create mode 100644 docs/source/how_to/save-predictions-in-different-output-formats.md
 create mode 100644 docs/source/how_to/tune-inference-clipping.md
 create mode 100644 docs/source/reference/api.md
 create mode 100644 docs/source/reference/app-config.md
 create mode 100644 docs/source/reference/evaluation-config.md
 create mode 100644 docs/source/reference/inference-config.md
 create mode 100644 docs/source/reference/output-formats.md
 create mode 100644 docs/source/reference/output-transforms.md
 create mode 100644 docs/source/reference/outputs-config.md

diff --git a/docs/source/explanation/evaluation-concepts-and-matching.md b/docs/source/explanation/evaluation-concepts-and-matching.md
new file mode 100644
index 0000000..96563ee
--- /dev/null
+++ b/docs/source/explanation/evaluation-concepts-and-matching.md
@@ -0,0 +1,48 @@
+# Evaluation concepts and matching
+
+Evaluation is not just "run predictions and compute one number".
+
+The reported metric depends on the evaluation task, the matching rule, and the treatment of clip boundaries and generic labels.
+
+## Task families answer different questions
+
+Built-in task families include:
+
+- sound event detection,
+- sound event classification,
+- top-class detection,
+- clip detection,
+- clip classification.
+
+Choose the task that matches the scientific or engineering question.
+
+## Matching matters
+
+For sound-event-style tasks, predictions and annotations are matched using an affinity function.
+
+Important controls include:
+
+- `affinity`,
+- `affinity_threshold`,
+- `strict_match`,
+- `ignore_start_end`.
+
+Small changes here can change the reported metric without changing the underlying predictions.
+
+## Boundary handling matters
+
+The evaluation base task can exclude events near clip boundaries through `ignore_start_end`.
+
+This is useful when clip boundaries make matches ambiguous.
+
+## Generic labels can matter in classification
+
+Classification tasks can include or exclude generic targets depending on configuration.
+
+That affects what counts as a valid class-level comparison.
+
+## Related pages
+
+- Evaluate on a test set: {doc}`../tutorials/evaluate-on-a-test-set`
+- Evaluation config reference: {doc}`../reference/evaluation-config`
+- Model output and validation: {doc}`model-output-and-validation`
diff --git a/docs/source/explanation/extracted-features-and-embeddings.md b/docs/source/explanation/extracted-features-and-embeddings.md
new file mode 100644
index 0000000..d2ea44b
--- /dev/null
+++ b/docs/source/explanation/extracted-features-and-embeddings.md
@@ -0,0 +1,36 @@
+# Extracted features and embeddings
+
+The current API exposes a per-detection `features` vector.
+
+Older BatDetect2 workflows also exposed concepts such as `cnn_feats`, `spec_features`, and `spec_slices`.
+
+## What the current feature vector is
+
+In the current stack, each retained detection can carry an internal feature representation produced by the model output pipeline.
+
+This is useful for downstream exploration, comparison, and custom analysis.
+
+## What these features are not
+
+They are not automatically human-interpretable ecological variables.
+
+They are also not a substitute for careful validation.
+
+## Why people refer to them as embeddings
+
+In practice, users often treat these feature vectors as embeddings because they can be used as dense learned representations of detections.
+
+That usage is reasonable, but you should still treat them as model-derived internal representations whose meaning depends on the training setup.
+
+## Legacy terminology versus current terminology
+
+- legacy `cnn_feats` referred to CNN feature outputs in the older workflow,
+- legacy `spec_features` referred to lower-level extracted call features,
+- current `features` are the per-detection vectors attached to `Detection` objects.
+
+These are related ideas, but not necessarily one-to-one replacements.
+
+## Related pages
+
+- Inspect detection features in Python: {doc}`../how_to/inspect-detection-features-in-python`
+- Legacy feature extraction: {doc}`../legacy/feature-extraction`
diff --git a/docs/source/explanation/interpreting-formatted-outputs.md b/docs/source/explanation/interpreting-formatted-outputs.md
new file mode 100644
index 0000000..5bd6d98
--- /dev/null
+++ b/docs/source/explanation/interpreting-formatted-outputs.md
@@ -0,0 +1,36 @@
+# Interpreting formatted outputs
+
+BatDetect2 can write predictions in several output formats.
+
+Those formats are different views of the same underlying detections, not different model behaviors.
+
+## Separate the underlying detection from the serialized file
+
+Internally, the current stack works with clip-level detections containing geometry, detection score, class scores, and features.
+
+Output formatters then serialize those detections in different ways.
+
+## Raw outputs are richest
+
+The `raw` format preserves the broadest structured view of detections and is a good default when you want to inspect or reload predictions later.
+
+## Tabular outputs are for analysis convenience
+
+The `parquet` format is convenient for data analysis workflows, but the tabular representation is only one projection of the underlying detection object.
+
+## Legacy-shaped outputs are mainly for compatibility
+
+The `batdetect2` formatter writes the older BatDetect2-style JSON shape.
+
+Use it when you need compatibility with older downstream tools or workflows.
+
+## The meaning does not come from the file extension
+
+Do not assume that a `.json`, `.parquet`, or `.nc` file changes what the model predicted.
+
+It changes how the prediction is packaged and how much detail is retained.
+
+## Related pages
+
+- Output formats reference: {doc}`../reference/output-formats`
+- Outputs config reference: {doc}`../reference/outputs-config`
diff --git a/docs/source/explanation/what-batdetect2-predicts.md b/docs/source/explanation/what-batdetect2-predicts.md
new file mode 100644
index 0000000..8ed4568
--- /dev/null
+++ b/docs/source/explanation/what-batdetect2-predicts.md
@@ -0,0 +1,45 @@
+# What BatDetect2 predicts
+
+BatDetect2 predicts call-level events, not recording-level truth.
+
+For each retained detection, the current stack can expose:
+
+- a geometry describing where the event sits in time-frequency space,
+- a detection score,
+- a class-score vector,
+- an internal feature vector.
+
+## Detection score versus class scores
+
+These are different outputs and should not be interpreted as the same thing.
+
+- The detection score is about whether the event is kept as a detection.
+- The class-score vector ranks classes for that detected event.
+
+A detection can be kept while still having uncertain class identity.
+
+## Predictions are conditional on the workflow
+
+The final output also depends on:
+
+- preprocessing,
+- postprocessing,
+- thresholds,
+- target definitions,
+- output transforms.
+
+That is why two runs can differ even when they use the same checkpoint.
+
+## What BatDetect2 does not predict
+
+BatDetect2 does not directly output ecological truth.
+
+It also does not eliminate the need for local validation.
+
+Use reviewed local data before making ecological claims.
+
+## Related pages
+
+- Model output and validation: {doc}`model-output-and-validation`
+- Postprocessing and thresholds: {doc}`postprocessing-and-thresholds`
+- Interpreting formatted outputs: {doc}`interpreting-formatted-outputs`
diff --git a/docs/source/how_to/choose-an-inference-input-mode.md b/docs/source/how_to/choose-an-inference-input-mode.md
new file mode 100644
index 0000000..ec89a72
--- /dev/null
+++ b/docs/source/how_to/choose-an-inference-input-mode.md
@@ -0,0 +1,66 @@
+# How to choose an inference input mode
+
+Use this guide to decide whether `predict directory`, `predict file_list`, or `predict dataset` is the right entry point for your run.
+
+## Use `predict directory` when the recordings already live together
+
+This is the simplest choice.
+
+Use it when:
+
+- your recordings are already organized in one directory tree,
+- you want BatDetect2 to discover audio files for you,
+- you are doing a first pass over a folder of recordings.
+
+```bash
+batdetect2 predict directory \
+  path/to/model.ckpt \
+  path/to/audio_dir \
+  path/to/outputs
+```
+
+## Use `predict file_list` when you need explicit control over the file set
+
+Use it when:
+
+- you want to run only a selected subset,
+- your files are spread across directories,
+- another tool has already produced the exact list of recordings to process.
+
+The list file should contain one path per line.
+
+```bash
+batdetect2 predict file_list \
+  path/to/model.ckpt \
+  path/to/audio_files.txt \
+  path/to/outputs
+```
+
+## Use `predict dataset` when your workflow is already annotation-set driven
+
+Use it when:
+
+- your project already has a `soundevent` annotation set,
+- you want prediction runs aligned with that annotation metadata,
+- you want BatDetect2 to resolve recording paths from the annotation set.
+
+```bash
+batdetect2 predict dataset \
+  path/to/model.ckpt \
+  path/to/annotation_set.json \
+  path/to/outputs
+```
+
+The dataset command reads a `soundevent` annotation set and extracts unique recording paths before inference.
+
+## Rule of thumb
+
+- Start with `directory` for the easiest first run.
+- Use `file_list` when selection matters.
+- Use `dataset` when the rest of your workflow is already dataset-based.
+
+## Related pages
+
+- Run batch predictions: {doc}`run-batch-predictions`
+- Tune inference clipping: {doc}`tune-inference-clipping`
+- Predict command reference: {doc}`../reference/cli/predict`
diff --git a/docs/source/how_to/choose-and-configure-evaluation-tasks.md b/docs/source/how_to/choose-and-configure-evaluation-tasks.md
new file mode 100644
index 0000000..17dd7af
--- /dev/null
+++ b/docs/source/how_to/choose-and-configure-evaluation-tasks.md
@@ -0,0 +1,66 @@
+# How to choose and configure evaluation tasks
+
+Use this guide when the default evaluation tasks do not match the question you want to answer.
+
+## Know the default first
+
+By default, BatDetect2 evaluation starts with:
+
+- sound event detection,
+- sound event classification.
+
+Those are good defaults for many projects, but not for all of them.
+
+## Choose the task that matches the question
+
+Common built-in task families include:
+
+- `sound_event_detection`
+- `sound_event_classification`
+- `top_class_detection`
+- `clip_detection`
+- `clip_classification`
+
+Choose based on the question you care about.
+
+- Use sound-event tasks when you care about individual call events.
+- Use clip tasks when you care about clip-level presence or clip-level class evidence.
+- Use top-class detection when you want matching based on the highest-scoring class per detection.
+
+## Configure tasks in `EvaluationConfig`
+
+Example:
+
+```yaml
+tasks:
+  - name: sound_event_detection
+    prefix: detection
+    affinity_threshold: 0.0
+    strict_match: true
+  - name: clip_classification
+    prefix: clip_classification
+```
+
+Pass the config with:
+
+```bash
+batdetect2 evaluate \
+  path/to/model.ckpt \
+  path/to/test_dataset.yaml \
+  --base-dir path/to/project_root \
+  --evaluation-config path/to/evaluation.yaml
+```
+
+Include `--base-dir` when the dataset config resolves recordings through relative paths.
+
+## Change one thing at a time
+
+When comparing models or settings, avoid changing task definitions, thresholds, matching behavior, and datasets all at once.
+
+Otherwise it becomes hard to explain why the metric changed.
+
+## Related pages
+
+- Evaluation tutorial: {doc}`../tutorials/evaluate-on-a-test-set`
+- Evaluation config reference: {doc}`../reference/evaluation-config`
+- Evaluation concepts: {doc}`../explanation/evaluation-concepts-and-matching`
diff --git a/docs/source/how_to/fine-tune-from-a-checkpoint.md b/docs/source/how_to/fine-tune-from-a-checkpoint.md
new file mode 100644
index 0000000..59fb2a6
--- /dev/null
+++ b/docs/source/how_to/fine-tune-from-a-checkpoint.md
@@ -0,0 +1,45 @@
+# How to fine-tune from a checkpoint
+
+Use this guide when you want to continue from an existing checkpoint instead of training a fresh model config.
+
+## Use `--model` for checkpoint-based training
+
+Pass a checkpoint with `--model`.
+
+Do not combine `--model` with `--model-config`.
+
+```bash
+batdetect2 train \
+  path/to/train_dataset.yaml \
+  --val-dataset path/to/val_dataset.yaml \
+  --model path/to/model.ckpt \
+  --training-config path/to/training.yaml
+```
+
+## Keep targets and preprocessing aligned
+
+If you override targets or audio-related settings while fine-tuning, validate that they still match the checkpoint and your dataset.
+
+Mismatches here can produce confusing failures or invalid comparisons.
+
+## Decide what question the fine-tune should answer
+
+Common fine-tuning goals are:
+
+- adapting to local recording conditions,
+- adapting to a new label set,
+- improving performance on a narrower deployment context.
+
+Make that goal explicit before comparing results.
+
+## Evaluate after fine-tuning
+
+Always compare the fine-tuned checkpoint against a held-out dataset.
+
+Use the same evaluation setup when comparing before and after.
+
+## Related pages
+
+- Training tutorial: {doc}`../tutorials/train-a-custom-model`
+- Evaluate a test set: {doc}`../tutorials/evaluate-on-a-test-set`
+- Train command reference: {doc}`../reference/cli/train`
diff --git a/docs/source/how_to/inspect-class-scores-in-python.md b/docs/source/how_to/inspect-class-scores-in-python.md
new file mode 100644
index 0000000..85ef664
--- /dev/null
+++ b/docs/source/how_to/inspect-class-scores-in-python.md
@@ -0,0 +1,44 @@
+# How to inspect class scores in Python
+
+Use this guide when you need more than the top class label for each detection.
+
+## Get the ranked class scores
+
+`BatDetect2API.get_class_scores` returns `(class_name, score)` pairs for one detection.
+
+```python
+from pathlib import Path
+
+from batdetect2.api_v2 import BatDetect2API
+
+api = BatDetect2API.from_checkpoint(Path("path/to/model.ckpt"))
+prediction = api.process_file(Path("path/to/audio.wav"))
+
+for detection in prediction.detections:
+    print("detection score:", detection.detection_score)
+    for class_name, score in api.get_class_scores(detection):
+        print(class_name, score)
+```
+
+## Separate detection confidence from class ranking
+
+Keep these two ideas separate:
+
+- `detection_score` tells you how strongly the model kept the event as a detection,
+- `class_scores` tell you how the model ranked classes for that detected event.
+
+A detection can have a reasonable detection score while still having uncertain class ranking.
+
+## Hide the top class if needed
+
+If you want to inspect only the alternatives, pass `include_top_class=False`.
+
+```python
+api.get_class_scores(detection, include_top_class=False)
+```
+
+## Related pages
+
+- Python tutorial: {doc}`../tutorials/integrate-with-a-python-pipeline`
+- API reference: {doc}`../reference/api`
+- Understanding scores: {doc}`../explanation/what-batdetect2-predicts`
diff --git a/docs/source/how_to/inspect-detection-features-in-python.md b/docs/source/how_to/inspect-detection-features-in-python.md
new file mode 100644
index 0000000..72c22f5
--- /dev/null
+++ b/docs/source/how_to/inspect-detection-features-in-python.md
@@ -0,0 +1,49 @@
+# How to inspect detection features in Python
+
+Use this guide when you want the per-detection feature vectors exposed by the current API.
+
+## Get the feature vector for one detection
+
+Each detection carries a `features` vector.
+
+The API exposes it through `get_detection_features`.
+
+```python
+from pathlib import Path
+
+from batdetect2.api_v2 import BatDetect2API
+
+api = BatDetect2API.from_checkpoint(Path("path/to/model.ckpt"))
+prediction = api.process_file(Path("path/to/audio.wav"))
+
+for detection in prediction.detections:
+    features = api.get_detection_features(detection)
+    print(features.shape)
+```
+
+## Use features for exploration, not as ground truth labels
+
+These features are internal model representations attached to detections.
+
+They can be useful for:
+
+- exploratory visualization,
+- downstream clustering,
+- comparison across detections,
+- building extra analysis pipelines.
+
+They do not replace validation.
+
+They also do not automatically have a one-to-one interpretation as ecological variables.
+
+## Save predictions with features included
+
+If you need features on disk, use an output format that supports them, such as `raw` or `parquet`, and keep feature inclusion enabled.
+
+See {doc}`save-predictions-in-different-output-formats`.
+
+## Related pages
+
+- Understanding features and embeddings: {doc}`../explanation/extracted-features-and-embeddings`
+- Output formats reference: {doc}`../reference/output-formats`
+- API reference: {doc}`../reference/api`
diff --git a/docs/source/how_to/interpret-evaluation-outputs.md b/docs/source/how_to/interpret-evaluation-outputs.md
new file mode 100644
index 0000000..f5556c0
--- /dev/null
+++ b/docs/source/how_to/interpret-evaluation-outputs.md
@@ -0,0 +1,41 @@
+# How to interpret evaluation outputs
+
+Use this guide after `batdetect2 evaluate` has written metrics and plots to disk.
+
+## Start by identifying the task
+
+Do not interpret a metric until you know which evaluation task produced it.
+
+For example, a detection score and a clip-classification score answer different questions.
+
+## Read the output directory as a bundle
+
+Treat the evaluation output directory as one package:
+
+- metrics,
+- plots,
+- saved predictions,
+- config context.
+
+Do not lift a single number out of context and treat it as the whole story.
+
+## Look for failure patterns, not just overall averages
+
+Check:
+
+- whether errors concentrate in certain taxa,
+- whether specific sites or recorder setups behave differently,
+- whether threshold choices are driving the result,
+- whether predictions are near clip boundaries or matching thresholds.
+
+## Keep validation and deployment questions separate
+
+A model can look good on one task and still be a poor fit for your deployment question.
+
+Interpret the outputs in relation to the real use case, not only the easiest metric to report.
+
+## Related pages
+
+- Evaluation tutorial: {doc}`../tutorials/evaluate-on-a-test-set`
+- Evaluation concepts: {doc}`../explanation/evaluation-concepts-and-matching`
+- Model output and validation: {doc}`../explanation/model-output-and-validation`
diff --git a/docs/source/how_to/run-batch-predictions.md b/docs/source/how_to/run-batch-predictions.md
index 4af7826..5d2d68c 100644
--- a/docs/source/how_to/run-batch-predictions.md
+++ b/docs/source/how_to/run-batch-predictions.md
@@ -3,6 +3,8 @@
 This guide shows practical command patterns for directory-based and file-list
 prediction runs.
 
+Use it after you already know which input mode you want and need concrete command templates for a repeatable batch run.
+
 ## Predict from a directory
 
 ```bash
@@ -12,6 +14,8 @@ batdetect2 predict directory \
   path/to/outputs
 ```
 
+Use this when BatDetect2 should discover the audio files for you.
+
 ## Predict from a file list
 
 ```bash
@@ -21,10 +25,35 @@ batdetect2 predict file_list \
   path/to/outputs
 ```
 
+Use this when another part of your workflow already produced the exact recording list to process.
+
+## Predict from a dataset config
+
+```bash
+batdetect2 predict dataset \
+  path/to/model.ckpt \
+  path/to/annotation_set.json \
+  path/to/outputs
+```
+
+Use this when your project already has a `soundevent` annotation set and you want to extract unique recording paths from it.
+
 ## Useful options
 
 - `--batch-size` to control throughput.
 - `--workers` to set data-loading parallelism.
 - `--format` to select output format.
+- `--inference-config` to control clipping and loader behavior.
+- `--outputs-config` to control serialization and output transforms.
+- `--detection-threshold` to override the detection threshold for a run.
 
-For complete option details, see {doc}`../reference/cli/index`.
+## Practical workflow
+
+For large runs:
+
+1. test the command on a small reviewed subset,
+2. lock the config files and command shape,
+3. write outputs to a dedicated directory per run,
+4. record the checkpoint, config paths, and thresholds used.
+
+For complete option details, see {doc}`../reference/cli/predict`.
diff --git a/docs/source/how_to/save-predictions-in-different-output-formats.md b/docs/source/how_to/save-predictions-in-different-output-formats.md
new file mode 100644
index 0000000..c354243
--- /dev/null
+++ b/docs/source/how_to/save-predictions-in-different-output-formats.md
@@ -0,0 +1,64 @@
+# How to save predictions in different output formats
+
+Use this guide when you need BatDetect2 outputs in a specific representation for downstream tools.
+
+## Choose the format that matches the job
+
+Current built-in output formats include:
+
+- `raw`: one NetCDF file per clip, best for rich structured outputs,
+- `parquet`: tabular storage for data analysis workflows,
+- `soundevent`: prediction-set JSON for soundevent-style tooling,
+- `batdetect2`: legacy per-recording JSON output.
+
+## Select a format from the CLI
+
+Use `--format` for quick experiments.
+
+```bash
+batdetect2 predict directory \
+  path/to/model.ckpt \
+  path/to/audio_dir \
+  path/to/outputs \
+  --format parquet
+```
+
+## Use an outputs config for repeatable runs
+
+Use an outputs config when you want reproducible control over format and transforms.
+
+Example:
+
+```yaml
+format:
+  name: raw
+  include_class_scores: true
+  include_features: true
+  include_geometry: true
+transform:
+  detection_transforms: []
+  clip_transforms: []
+```
+
+Run with:
+
+```bash
+batdetect2 predict directory \
+  path/to/model.ckpt \
+  path/to/audio_dir \
+  path/to/outputs \
+  --outputs-config path/to/outputs.yaml
+```
+
+## Pick the simplest useful format
+
+- Use `raw` if you want the richest output surface and easy round-tripping.
+- Use `parquet` if you want tabular analysis in Python or data-lake workflows.
+- Use `soundevent` if you want prediction-set JSON.
+- Use `batdetect2` only when you need the legacy JSON shape.
+
+## Related pages
+
+- Outputs config reference: {doc}`../reference/outputs-config`
+- Output formats reference: {doc}`../reference/output-formats`
+- Output transforms reference: {doc}`../reference/output-transforms`
diff --git a/docs/source/how_to/tune-detection-threshold.md b/docs/source/how_to/tune-detection-threshold.md
index da21d3e..2229c9c 100644
--- a/docs/source/how_to/tune-detection-threshold.md
+++ b/docs/source/how_to/tune-detection-threshold.md
@@ -2,6 +2,10 @@
 
 Use this guide to compare detection outputs at different threshold values.
 
+The goal is not to find a universal threshold.
+
+The goal is to choose a threshold that fits your reviewed local data and the project trade-off between missed calls and false positives.
+
 ## 1) Start with a baseline run
 
 Run an initial prediction workflow and keep outputs in a dedicated folder.
@@ -20,11 +24,22 @@ batdetect2 predict directory \
   --detection-threshold 0.3
 ```
 
+Keep each threshold run in a separate output directory.
+
+That makes it easier to compare counts and inspect example files without mixing results.
+
 ## 3) Validate against known calls
 
 Use files with trusted annotations or expert review to select a threshold that
 fits your project goals.
 
+Check both:
+
+- obvious false positives,
+- obvious missed calls.
+
+If class interpretation matters downstream, inspect class ranking behavior as well, not just detection counts.
+
 ## 4) Record your chosen setting
 
 Write down the chosen threshold and rationale so analyses are reproducible.
diff --git a/docs/source/how_to/tune-inference-clipping.md b/docs/source/how_to/tune-inference-clipping.md
new file mode 100644
index 0000000..3e3d164
--- /dev/null
+++ b/docs/source/how_to/tune-inference-clipping.md
@@ -0,0 +1,63 @@
+# How to tune inference clipping
+
+Use this guide when long recordings need to be split into smaller clips during inference.
+
+## What clipping controls
+
+`InferenceConfig.clipping` controls how recordings are split before batching.
+
+Key fields are:
+
+- `duration`: clip duration in seconds,
+- `overlap`: overlap between adjacent clips,
+- `max_empty`: how much empty padding is allowed,
+- `discard_empty`: whether empty clips are dropped.
+
+## Start from the defaults
+
+Use the built-in clipping behavior first unless you already know you need something else.
+
+Only tune clipping when:
+
+- recordings are much longer than your normal working set,
+- you are seeing edge effects around calls,
+- you need tighter control over throughput or padding behavior.
+
+## Override clipping with an inference config
+
+Create an inference config file and pass it to `predict` or `evaluate`.
+
+Example:
+
+```yaml
+clipping:
+  enabled: true
+  duration: 0.5
+  overlap: 0.1
+  max_empty: 0.0
+  discard_empty: true
+loader:
+  batch_size: 8
+```
+
+Run with:
+
+```bash
+batdetect2 predict directory \
+  path/to/model.ckpt \
+  path/to/audio_dir \
+  path/to/outputs \
+  --inference-config path/to/inference.yaml
+```
+
+## Validate clipping changes on a small reviewed subset
+
+Changing clipping changes what the model sees per batch and can change how events near clip boundaries behave.
+
+Check a reviewed subset before applying clipping changes to a full project.
+
+## Related pages
+
+- Inference config reference: {doc}`../reference/inference-config`
+- Run batch predictions: {doc}`run-batch-predictions`
+- Understanding the pipeline: {doc}`../explanation/pipeline-overview`
diff --git a/docs/source/reference/api.md b/docs/source/reference/api.md
new file mode 100644
index 0000000..d514bce
--- /dev/null
+++ b/docs/source/reference/api.md
@@ -0,0 +1,65 @@
+# `BatDetect2API` reference
+
+`BatDetect2API` is the main entry point for the current Python workflow.
+
+It wraps model loading, inference, evaluation, output formatting, and training-related entry points behind one object.
+
+Defined in `batdetect2.api_v2`.
+
+## Create an API instance
+
+- `BatDetect2API.from_checkpoint(path, ...)`
+  - load a trained checkpoint and optional config overrides.
+- `BatDetect2API.from_config(config)`
+  - build a full stack from a `BatDetect2Config` object.
+
+## Inference methods
+
+- `process_file(audio_file, ...)`
+  - run inference for one recording.
+- `process_files(audio_files, ...)`
+  - run batch inference across a sequence of file paths.
+- `process_directory(audio_dir, ...)`
+  - run inference across the audio files found in one directory.
+- `process_clips(clips, ...)`
+  - run inference on an explicit sequence of clip objects.
+- `process_audio(audio, ...)`
+  - run inference starting from a waveform array.
+- `process_spectrogram(spec, ...)`
+  - run inference starting from a spectrogram tensor.
+
+## Prediction inspection helpers
+
+- `get_top_class_name(detection)`
+  - return the highest-scoring class name for one detection.
+- `get_class_scores(detection, include_top_class=True, sort_descending=True)`
+  - return ranked `(class_name, score)` pairs.
+- `get_detection_features(detection)`
+  - return the per-detection feature vector.
+
+## Audio loading helpers
+
+- `load_audio(path)`
+- `load_recording(recording)`
+- `load_clip(clip)`
+- `generate_spectrogram(audio)`
+
+## Output persistence helpers
+
+- `save_predictions(predictions, path, audio_dir=None, format=None, config=None)`
+- `load_predictions(path, format=None, config=None)`
+
+Use these when you want to save programmatic predictions without going through the CLI.
+
+## Training and evaluation entry points
+
+- `train(...)`
+- `finetune(...)`
+- `evaluate(...)`
+- `evaluate_predictions(...)`
+
+## Related pages
+
+- Python tutorial: {doc}`../tutorials/integrate-with-a-python-pipeline`
+- Outputs config reference: {doc}`outputs-config`
+- Output formats reference: {doc}`output-formats`
diff --git a/docs/source/reference/app-config.md b/docs/source/reference/app-config.md
new file mode 100644
index 0000000..1237c0f
--- /dev/null
+++ b/docs/source/reference/app-config.md
@@ -0,0 +1,38 @@
+# Top-level app config reference
+
+The top-level config object is `BatDetect2Config`.
+
+Defined in `batdetect2.config`.
+
+It combines the main configuration surfaces used across training, inference, evaluation, outputs, and logging.
+
+## Fields
+
+- `config_version`
+- `train`
+  - training-specific config.
+- `evaluation`
+  - evaluation task and plot config.
+- `model`
+  - model architecture, preprocessing, postprocessing, and targets.
+- `audio`
+  - audio loading and resampling config.
+- `inference`
+  - clipping and loader config for prediction-time workflows.
+- `outputs`
+  - output format and output transform config.
+- `logging`
+  - logging backend and formatting config.
+
+## Mental model
+
+Think of `BatDetect2Config` as the complete application wiring for the current stack.
+
+Use it when you want one reproducible config that describes the whole workflow.
+
+## Related pages
+
+- Inference config: {doc}`inference-config`
+- Evaluation config: {doc}`evaluation-config`
+- Outputs config: {doc}`outputs-config`
+- General config reference: {doc}`configs`
diff --git a/docs/source/reference/evaluation-config.md b/docs/source/reference/evaluation-config.md
new file mode 100644
index 0000000..a79afed
--- /dev/null
+++ b/docs/source/reference/evaluation-config.md
@@ -0,0 +1,46 @@
+# Evaluation config reference
+
+`EvaluationConfig` defines which evaluation tasks run and which plots they generate.
+
+Defined in `batdetect2.evaluate.config`.
+
+## Top-level fields
+
+- `tasks`
+  - list of task configs.
+
+## Built-in task families
+
+Current built-in tasks include:
+
+- `sound_event_detection`
+- `sound_event_classification`
+- `top_class_detection`
+- `clip_detection`
+- `clip_classification`
+
+## Shared task controls
+
+Common task-level controls include:
+
+- `prefix`
+- `ignore_start_end`
+
+Sound-event-style tasks also support:
+
+- `affinity`
+- `affinity_threshold`
+- `strict_match`
+
+## Default behavior
+
+The default evaluation config starts with:
+
+- sound event detection,
+- sound event classification.
+
+## Related pages
+
+- Choose and configure evaluation tasks: {doc}`../how_to/choose-and-configure-evaluation-tasks`
+- Evaluation concepts: {doc}`../explanation/evaluation-concepts-and-matching`
+- Evaluate CLI reference: {doc}`cli/evaluate`
diff --git a/docs/source/reference/inference-config.md b/docs/source/reference/inference-config.md
new file mode 100644
index 0000000..1aeebbc
--- /dev/null
+++ b/docs/source/reference/inference-config.md
@@ -0,0 +1,41 @@
+# Inference config reference
+
+`InferenceConfig` controls how files are clipped and batched during prediction-time workflows.
+
+Defined in `batdetect2.inference.config`.
+
+## Top-level fields
+
+- `loader`
+  - data-loader settings for inference.
+- `clipping`
+  - controls how recordings are split into clips before batching.
+
+## `loader`
+
+Current built-in loader field:
+
+- `batch_size` (int, default `8`)
+
+## `clipping`
+
+Fields:
+
+- `enabled` (bool)
+- `duration` (float, seconds)
+- `overlap` (float, seconds)
+- `max_empty` (float)
+- `discard_empty` (bool)
+
+## When to override this config
+
+Override `InferenceConfig` when:
+
+- long recordings need different clipping behavior,
+- you want to tune batch size for your hardware,
+- you need reproducible prediction settings across runs.
+
+## Related pages
+
+- Tune inference clipping: {doc}`../how_to/tune-inference-clipping`
+- Predict CLI reference: {doc}`cli/predict`
diff --git a/docs/source/reference/output-formats.md b/docs/source/reference/output-formats.md
new file mode 100644
index 0000000..a4780f1
--- /dev/null
+++ b/docs/source/reference/output-formats.md
@@ -0,0 +1,63 @@
+# Output formats reference
+
+BatDetect2 currently supports several built-in output formatters.
+
+## `raw`
+
+Defined by `RawOutputConfig`.
+
+Best for rich structured outputs and round-tripping.
+
+Key fields:
+
+- `include_class_scores`
+- `include_features`
+- `include_geometry`
+
+Writes one NetCDF `.nc` file per clip.
+
+## `parquet`
+
+Defined by `ParquetOutputConfig`.
+
+Best for tabular analysis workflows.
+
+Key fields:
+
+- `include_class_scores`
+- `include_features`
+- `include_geometry`
+
+Writes a parquet table, typically `predictions.parquet`.
+
+## `soundevent`
+
+Defined by `SoundEventOutputConfig`.
+
+Best when you want a `PredictionSet` JSON workflow.
+
+Key fields:
+
+- `top_k`
+- `min_score`
+
+Writes a prediction-set JSON file.
+
+## `batdetect2`
+
+Defined by `BatDetect2OutputConfig`.
+
+This is the legacy BatDetect2-style JSON output.
+
+Key fields:
+
+- `event_name`
+- `annotation_note`
+
+Writes one `.json` file per recording.
+
+## Related pages
+
+- Outputs config: {doc}`outputs-config`
+- Save predictions in different output formats: {doc}`../how_to/save-predictions-in-different-output-formats`
+- Understanding formatted outputs: {doc}`../explanation/interpreting-formatted-outputs`
diff --git a/docs/source/reference/output-transforms.md b/docs/source/reference/output-transforms.md
new file mode 100644
index 0000000..b132065
--- /dev/null
+++ b/docs/source/reference/output-transforms.md
@@ -0,0 +1,37 @@
+# Output transforms reference
+
+Output transforms operate after decoding and before formatting.
+
+Defined in `batdetect2.outputs.transforms`.
+
+## Top-level config
+
+`OutputTransformConfig` contains:
+
+- `detection_transforms`
+- `clip_transforms`
+
+## Detection transforms
+
+Detection transforms operate on one detection at a time.
+
+Built-in examples include:
+
+- filtering by frequency,
+- filtering by duration.
+
+These can remove detections entirely if they fail the transform.
+
+## Clip transforms
+
+Clip transforms operate on the list of detections for one clip.
+
+Built-in examples include:
+
+- removing detections above Nyquist,
+- removing detections at clip edges.
+
+## Related pages
+
+- Outputs config: {doc}`outputs-config`
+- Understanding outputs: {doc}`../explanation/interpreting-formatted-outputs`
diff --git a/docs/source/reference/outputs-config.md b/docs/source/reference/outputs-config.md
new file mode 100644
index 0000000..d548b18
--- /dev/null
+++ b/docs/source/reference/outputs-config.md
@@ -0,0 +1,33 @@
+# Outputs config reference
+
+`OutputsConfig` controls two layers of prediction handling:
+
+- how detections are transformed before formatting,
+- how formatted outputs are written to disk.
+
+Defined in `batdetect2.outputs.config`.
+
+## Fields
+
+- `format`
+  - output format config.
+- `transform`
+  - output transform config.
+
+## Mental model
+
+The output workflow is:
+
+1. model outputs are decoded into detections,
+2. optional output transforms filter or adjust those detections,
+3. a formatter serializes them to disk.
+
+## Default behavior
+
+By default, the current stack uses the raw output formatter unless you override it.
+
+## Related pages
+
+- Output formats: {doc}`output-formats`
+- Output transforms: {doc}`output-transforms`
+- Save predictions in different output formats: {doc}`../how_to/save-predictions-in-different-output-formats`