docs: refresh evaluation tutorial outputs and setup

2026-05-22 22:32:18 +02:00 · 2026-05-07 08:24:08 +01:00 · 2026-05-07 08:24:08 +01:00 · 7a0f5c4b5a
commit 7a0f5c4b5a
parent 5760b6e017
1 changed files with 85 additions and 44 deletions
--- a/docs/source/tutorials/evaluate-on-a-test-set.md
+++ b/docs/source/tutorials/evaluate-on-a-test-set.md
@ -1,92 +1,133 @@
-# Tutorial: Evaluate on a test set
+# Evaluate on a test set
 This tutorial shows how to evaluate a trained checkpoint on a held-out dataset
 and inspect the output metrics.
-This tutorial is for advanced users who want to compare one trained model
+Use it when you want to measure how a model performs on labelled data that was
-against a separate test dataset.
+kept aside for testing.
 ## Before you start
- A trained model checkpoint.
+You need:
- A test dataset config file.
+
- (Optional) Targets, audio, inference, and evaluation config overrides.
+- a test dataset config,
 - a trained checkpoint or model alias.
 ```{note}
 This page is for model evaluation.
-If you only want to run BatDetect2 on recordings,
+If you only want to run BatDetect2 on recordings, start with
-start with {doc}`run-inference-on-folder` instead.
+{doc}`run-inference-on-folder` instead.
 ```
-## Outcome
+## What you will do
 By the end of this tutorial you will have:
 - prepared a test dataset config,
 - run `batdetect2 evaluate`,
 - written evaluation metrics and result files,
- understood what to inspect first,
+- identified the next pages for model choice and evaluation configuration.
 - identified the next pages for evaluation concepts and configuration.
-## 1. Start with a held-out dataset
+## 1. Create a test dataset config
 Evaluation needs a dataset config that points to the labelled data you want to
 use for testing.
 This is the same kind of dataset config used for training.
 It explicitly declares which data sources BatDetect2 should read, including the
 audio files and their annotations.
 For an example, see `example_data/dataset.yaml`.
 If you need help creating the dataset config, follow the dataset section in
 {doc}`train-a-custom-model`.
 For more detail on dataset source formats, see {doc}`../reference/data-sources`.
 Use a dataset that was not used for training or tuning.
 A held-out dataset is simply a separate dataset kept aside for evaluation.
 If you tune thresholds or configs on the same dataset that you report as final
 evaluation, the results will be optimistic.
 ## 2. Run evaluation
 For a simple run, use:
 ```bash
 batdetect2 evaluate \
  path/to/test_dataset.yaml
 ```
 If you do not pass `--model`, BatDetect2 uses the built-in default UK model.
 If you want to choose a different checkpoint, alias, or Hugging Face model, see
 {doc}`../how_to/choose-a-model`.
 If you want to save the results somewhere else, add `--output-dir`:
 ```bash
 batdetect2 evaluate \
  path/to/test_dataset.yaml \
  --model path/to/model.ckpt \
  --base-dir path/to/project_root \
  --output-dir path/to/eval_outputs
 ```
-This command loads the checkpoint, runs prediction on the test dataset, applies
+This command loads the model, runs prediction on the test dataset, applies the
-the chosen evaluation tasks, and writes metrics and result files to the output
+evaluation tasks, and writes the results to the output directory.
 directory.
-Use `--base-dir` whenever the dataset config contains relative paths.
+## 3. Check the output files
-That is the common case for project-local dataset files.
+By default, the CLI writes evaluation outputs to `outputs/evaluation`.
-## 3. Inspect the output directory
+With the default evaluation config, a run will usually create a folder like
 this:
-Look for:
+```text
 outputs/evaluation/
  version_0/
    metrics.csv
    hparams.yaml
 ```
- summary metrics,
+The most important file is `metrics.csv`.
- generated plots,
+It contains the metric values computed for the evaluation run.
 - saved prediction files if they were enabled,
 - enough metadata to reproduce the run later.
-The exact set depends on the configured evaluation tasks and plots.
+A file like this might start like:
-## 4. Interpret the results in context
+```csv
 classification/average_precision/barbar,classification/average_precision/cneser,...,detection/average_precision
 0.898695170879364,0.9408193826675415,...,0.851219117641449
 ```
-Do not reduce evaluation to a single number.
+The exact columns depend on the evaluation tasks you run.
-Check:
+The `hparams.yaml` file records the config used for the evaluation run.
- which task the metric belongs to,
+## 4. Expect extra plots and files when configs enable them
 - which thresholding or matching assumptions were used,
 - whether class-level behavior matches your use case,
 - whether the failures are concentrated in specific taxa, sites, or recording
  conditions.
-## 5. Record the evaluation setup
+You may also see extra outputs such as plots and saved predictions.
-Keep the command, config files, checkpoint path, and dataset version together.
+For example, if you run evaluation with `example_data/configs/evaluation.yaml`,
 you should expect a richer output folder with:
-That matters for reproducibility and for later model comparisons.
+- `metrics.csv`
 - `hparams.yaml`
 - a `plots/` directory
 - a `predictions/` directory
-## What to do next
+That config enables more evaluation tasks and plots than the default setup.
- Compare thresholds on representative files:
+So, depending on your evaluation config, you may see files such as:
-  {doc}`../how_to/tune-detection-threshold`
+
 - precision-recall plots,
 - ROC curves,
 - confusion matrices,
 - example detection plots,
 - saved prediction files.
 If you want to control which tasks run and which plots are generated, see
 {doc}`../reference/evaluation-config` and
 {doc}`../how_to/choose-and-configure-evaluation-tasks`.
 ## Common next steps
 - Choose a different model:
  {doc}`../how_to/choose-a-model`
 - Configure evaluation tasks:
  {doc}`../how_to/choose-and-configure-evaluation-tasks`
 - Interpret evaluation artifacts: