batdetect2/docs/source/tutorials/evaluate-on-a-test-set.md
2026-04-30 11:48:11 +01:00

92 lines
2.7 KiB
Markdown

# Tutorial: Evaluate on a test set
This tutorial shows how to evaluate a trained checkpoint on a held-out dataset
and inspect the output metrics.
This tutorial is for advanced users who want to compare one trained model against a separate test dataset.
## Before you start
- A trained model checkpoint.
- A test dataset config file.
- (Optional) Targets, audio, inference, and evaluation config overrides.
```{note}
This page is for model evaluation.
If you only want to run BatDetect2 on recordings,
start with {doc}`run-inference-on-folder` instead.
```
## Outcome
By the end of this tutorial you will have:
- run `batdetect2 evaluate`,
- written evaluation metrics and result files,
- understood what to inspect first,
- identified the next pages for evaluation concepts and configuration.
## 1. Start with a held-out dataset
Use a dataset that was not used for training or tuning.
A held-out dataset is simply a separate dataset kept aside for evaluation.
If you tune thresholds or configs on the same dataset that you report as final evaluation, the results will be optimistic.
## 2. Run evaluation
```bash
batdetect2 evaluate \
path/to/model.ckpt \
path/to/test_dataset.yaml \
--base-dir path/to/project_root \
--output-dir path/to/eval_outputs
```
This command loads the checkpoint,
runs prediction on the test dataset,
applies the chosen evaluation tasks,
and writes metrics and result files to the output directory.
Use `--base-dir` whenever the dataset config contains relative paths.
That is the common case for project-local dataset files.
## 3. Inspect the output directory
Look for:
- summary metrics,
- generated plots,
- saved prediction files if they were enabled,
- enough metadata to reproduce the run later.
The exact set depends on the configured evaluation tasks and plots.
## 4. Interpret the results in context
Do not reduce evaluation to a single number.
Check:
- which task the metric belongs to,
- which thresholding or matching assumptions were used,
- whether class-level behavior matches your use case,
- whether the failures are concentrated in specific taxa, sites, or recording conditions.
## 5. Record the evaluation setup
Keep the command, config files, checkpoint path, and dataset version together.
That matters for reproducibility and for later model comparisons.
## What to do next
- Compare thresholds on representative files:
{doc}`../how_to/tune-detection-threshold`
- Configure evaluation tasks: {doc}`../how_to/choose-and-configure-evaluation-tasks`
- Interpret evaluation artifacts: {doc}`../how_to/interpret-evaluation-outputs`
- Learn the evaluation concepts: {doc}`../explanation/evaluation-concepts-and-matching`
- Check full evaluate options: {doc}`../reference/cli/evaluate`