mirror of
https://github.com/macaodha/batdetect2.git
synced 2026-05-23 06:41:53 +02:00
92 lines
2.7 KiB
Markdown
92 lines
2.7 KiB
Markdown
# Tutorial: Evaluate on a test set
|
|
|
|
This tutorial shows how to evaluate a trained checkpoint on a held-out dataset
|
|
and inspect the output metrics.
|
|
|
|
This tutorial is for advanced users who want to compare one trained model against a separate test dataset.
|
|
|
|
## Before you start
|
|
|
|
- A trained model checkpoint.
|
|
- A test dataset config file.
|
|
- (Optional) Targets, audio, inference, and evaluation config overrides.
|
|
|
|
```{note}
|
|
This page is for model evaluation.
|
|
If you only want to run BatDetect2 on recordings,
|
|
start with {doc}`run-inference-on-folder` instead.
|
|
```
|
|
|
|
## Outcome
|
|
|
|
By the end of this tutorial you will have:
|
|
|
|
- run `batdetect2 evaluate`,
|
|
- written evaluation metrics and result files,
|
|
- understood what to inspect first,
|
|
- identified the next pages for evaluation concepts and configuration.
|
|
|
|
## 1. Start with a held-out dataset
|
|
|
|
Use a dataset that was not used for training or tuning.
|
|
|
|
A held-out dataset is simply a separate dataset kept aside for evaluation.
|
|
|
|
If you tune thresholds or configs on the same dataset that you report as final evaluation, the results will be optimistic.
|
|
|
|
## 2. Run evaluation
|
|
|
|
```bash
|
|
batdetect2 evaluate \
|
|
path/to/model.ckpt \
|
|
path/to/test_dataset.yaml \
|
|
--base-dir path/to/project_root \
|
|
--output-dir path/to/eval_outputs
|
|
```
|
|
|
|
This command loads the checkpoint,
|
|
runs prediction on the test dataset,
|
|
applies the chosen evaluation tasks,
|
|
and writes metrics and result files to the output directory.
|
|
|
|
Use `--base-dir` whenever the dataset config contains relative paths.
|
|
|
|
That is the common case for project-local dataset files.
|
|
|
|
## 3. Inspect the output directory
|
|
|
|
Look for:
|
|
|
|
- summary metrics,
|
|
- generated plots,
|
|
- saved prediction files if they were enabled,
|
|
- enough metadata to reproduce the run later.
|
|
|
|
The exact set depends on the configured evaluation tasks and plots.
|
|
|
|
## 4. Interpret the results in context
|
|
|
|
Do not reduce evaluation to a single number.
|
|
|
|
Check:
|
|
|
|
- which task the metric belongs to,
|
|
- which thresholding or matching assumptions were used,
|
|
- whether class-level behavior matches your use case,
|
|
- whether the failures are concentrated in specific taxa, sites, or recording conditions.
|
|
|
|
## 5. Record the evaluation setup
|
|
|
|
Keep the command, config files, checkpoint path, and dataset version together.
|
|
|
|
That matters for reproducibility and for later model comparisons.
|
|
|
|
## What to do next
|
|
|
|
- Compare thresholds on representative files:
|
|
{doc}`../how_to/tune-detection-threshold`
|
|
- Configure evaluation tasks: {doc}`../how_to/choose-and-configure-evaluation-tasks`
|
|
- Interpret evaluation artifacts: {doc}`../how_to/interpret-evaluation-outputs`
|
|
- Learn the evaluation concepts: {doc}`../explanation/evaluation-concepts-and-matching`
|
|
- Check full evaluate options: {doc}`../reference/cli/evaluate`
|