batdetect2/docs/source/tutorials/evaluate-on-a-test-set.md
2026-05-06 17:22:18 +01:00

2.7 KiB

Tutorial: Evaluate on a test set

This tutorial shows how to evaluate a trained checkpoint on a held-out dataset and inspect the output metrics.

This tutorial is for advanced users who want to compare one trained model against a separate test dataset.

Before you start

  • A trained model checkpoint.
  • A test dataset config file.
  • (Optional) Targets, audio, inference, and evaluation config overrides.
This page is for model evaluation.
If you only want to run BatDetect2 on recordings,
start with {doc}`run-inference-on-folder` instead.

Outcome

By the end of this tutorial you will have:

  • run batdetect2 evaluate,
  • written evaluation metrics and result files,
  • understood what to inspect first,
  • identified the next pages for evaluation concepts and configuration.

1. Start with a held-out dataset

Use a dataset that was not used for training or tuning.

A held-out dataset is simply a separate dataset kept aside for evaluation.

If you tune thresholds or configs on the same dataset that you report as final evaluation, the results will be optimistic.

2. Run evaluation

batdetect2 evaluate \
  path/to/test_dataset.yaml \
  --model path/to/model.ckpt \
  --base-dir path/to/project_root \
  --output-dir path/to/eval_outputs

This command loads the checkpoint, runs prediction on the test dataset, applies the chosen evaluation tasks, and writes metrics and result files to the output directory.

Use --base-dir whenever the dataset config contains relative paths.

That is the common case for project-local dataset files.

3. Inspect the output directory

Look for:

  • summary metrics,
  • generated plots,
  • saved prediction files if they were enabled,
  • enough metadata to reproduce the run later.

The exact set depends on the configured evaluation tasks and plots.

4. Interpret the results in context

Do not reduce evaluation to a single number.

Check:

  • which task the metric belongs to,
  • which thresholding or matching assumptions were used,
  • whether class-level behavior matches your use case,
  • whether the failures are concentrated in specific taxa, sites, or recording conditions.

5. Record the evaluation setup

Keep the command, config files, checkpoint path, and dataset version together.

That matters for reproducibility and for later model comparisons.

What to do next

  • Compare thresholds on representative files: {doc}../how_to/tune-detection-threshold
  • Configure evaluation tasks: {doc}../how_to/choose-and-configure-evaluation-tasks
  • Interpret evaluation artifacts: {doc}../how_to/interpret-evaluation-outputs
  • Learn the evaluation concepts: {doc}../explanation/evaluation-concepts-and-matching
  • Check full evaluate options: {doc}../reference/cli/evaluate