mirror of
https://github.com/macaodha/batdetect2.git
synced 2026-05-22 22:32:18 +02:00
docs: refresh evaluation tutorial outputs and setup
This commit is contained in:
parent
5760b6e017
commit
7a0f5c4b5a
@ -1,92 +1,133 @@
|
||||
# Tutorial: Evaluate on a test set
|
||||
# Evaluate on a test set
|
||||
|
||||
This tutorial shows how to evaluate a trained checkpoint on a held-out dataset
|
||||
and inspect the output metrics.
|
||||
|
||||
This tutorial is for advanced users who want to compare one trained model
|
||||
against a separate test dataset.
|
||||
Use it when you want to measure how a model performs on labelled data that was
|
||||
kept aside for testing.
|
||||
|
||||
## Before you start
|
||||
|
||||
- A trained model checkpoint.
|
||||
- A test dataset config file.
|
||||
- (Optional) Targets, audio, inference, and evaluation config overrides.
|
||||
You need:
|
||||
|
||||
- a test dataset config,
|
||||
- a trained checkpoint or model alias.
|
||||
|
||||
```{note}
|
||||
This page is for model evaluation.
|
||||
If you only want to run BatDetect2 on recordings,
|
||||
start with {doc}`run-inference-on-folder` instead.
|
||||
If you only want to run BatDetect2 on recordings, start with
|
||||
{doc}`run-inference-on-folder` instead.
|
||||
```
|
||||
|
||||
## Outcome
|
||||
## What you will do
|
||||
|
||||
By the end of this tutorial you will have:
|
||||
|
||||
- prepared a test dataset config,
|
||||
- run `batdetect2 evaluate`,
|
||||
- written evaluation metrics and result files,
|
||||
- understood what to inspect first,
|
||||
- identified the next pages for evaluation concepts and configuration.
|
||||
- identified the next pages for model choice and evaluation configuration.
|
||||
|
||||
## 1. Start with a held-out dataset
|
||||
## 1. Create a test dataset config
|
||||
|
||||
Evaluation needs a dataset config that points to the labelled data you want to
|
||||
use for testing.
|
||||
|
||||
This is the same kind of dataset config used for training.
|
||||
It explicitly declares which data sources BatDetect2 should read, including the
|
||||
audio files and their annotations.
|
||||
|
||||
For an example, see `example_data/dataset.yaml`.
|
||||
|
||||
If you need help creating the dataset config, follow the dataset section in
|
||||
{doc}`train-a-custom-model`.
|
||||
For more detail on dataset source formats, see {doc}`../reference/data-sources`.
|
||||
|
||||
Use a dataset that was not used for training or tuning.
|
||||
|
||||
A held-out dataset is simply a separate dataset kept aside for evaluation.
|
||||
|
||||
If you tune thresholds or configs on the same dataset that you report as final
|
||||
evaluation, the results will be optimistic.
|
||||
|
||||
## 2. Run evaluation
|
||||
|
||||
For a simple run, use:
|
||||
|
||||
```bash
|
||||
batdetect2 evaluate \
|
||||
path/to/test_dataset.yaml
|
||||
```
|
||||
|
||||
If you do not pass `--model`, BatDetect2 uses the built-in default UK model.
|
||||
If you want to choose a different checkpoint, alias, or Hugging Face model, see
|
||||
{doc}`../how_to/choose-a-model`.
|
||||
|
||||
If you want to save the results somewhere else, add `--output-dir`:
|
||||
|
||||
```bash
|
||||
batdetect2 evaluate \
|
||||
path/to/test_dataset.yaml \
|
||||
--model path/to/model.ckpt \
|
||||
--base-dir path/to/project_root \
|
||||
--output-dir path/to/eval_outputs
|
||||
```
|
||||
|
||||
This command loads the checkpoint, runs prediction on the test dataset, applies
|
||||
the chosen evaluation tasks, and writes metrics and result files to the output
|
||||
directory.
|
||||
This command loads the model, runs prediction on the test dataset, applies the
|
||||
evaluation tasks, and writes the results to the output directory.
|
||||
|
||||
Use `--base-dir` whenever the dataset config contains relative paths.
|
||||
## 3. Check the output files
|
||||
|
||||
That is the common case for project-local dataset files.
|
||||
By default, the CLI writes evaluation outputs to `outputs/evaluation`.
|
||||
|
||||
## 3. Inspect the output directory
|
||||
With the default evaluation config, a run will usually create a folder like
|
||||
this:
|
||||
|
||||
Look for:
|
||||
```text
|
||||
outputs/evaluation/
|
||||
version_0/
|
||||
metrics.csv
|
||||
hparams.yaml
|
||||
```
|
||||
|
||||
- summary metrics,
|
||||
- generated plots,
|
||||
- saved prediction files if they were enabled,
|
||||
- enough metadata to reproduce the run later.
|
||||
The most important file is `metrics.csv`.
|
||||
It contains the metric values computed for the evaluation run.
|
||||
|
||||
The exact set depends on the configured evaluation tasks and plots.
|
||||
A file like this might start like:
|
||||
|
||||
## 4. Interpret the results in context
|
||||
```csv
|
||||
classification/average_precision/barbar,classification/average_precision/cneser,...,detection/average_precision
|
||||
0.898695170879364,0.9408193826675415,...,0.851219117641449
|
||||
```
|
||||
|
||||
Do not reduce evaluation to a single number.
|
||||
The exact columns depend on the evaluation tasks you run.
|
||||
|
||||
Check:
|
||||
The `hparams.yaml` file records the config used for the evaluation run.
|
||||
|
||||
- which task the metric belongs to,
|
||||
- which thresholding or matching assumptions were used,
|
||||
- whether class-level behavior matches your use case,
|
||||
- whether the failures are concentrated in specific taxa, sites, or recording
|
||||
conditions.
|
||||
## 4. Expect extra plots and files when configs enable them
|
||||
|
||||
## 5. Record the evaluation setup
|
||||
You may also see extra outputs such as plots and saved predictions.
|
||||
|
||||
Keep the command, config files, checkpoint path, and dataset version together.
|
||||
For example, if you run evaluation with `example_data/configs/evaluation.yaml`,
|
||||
you should expect a richer output folder with:
|
||||
|
||||
That matters for reproducibility and for later model comparisons.
|
||||
- `metrics.csv`
|
||||
- `hparams.yaml`
|
||||
- a `plots/` directory
|
||||
- a `predictions/` directory
|
||||
|
||||
## What to do next
|
||||
That config enables more evaluation tasks and plots than the default setup.
|
||||
|
||||
- Compare thresholds on representative files:
|
||||
{doc}`../how_to/tune-detection-threshold`
|
||||
So, depending on your evaluation config, you may see files such as:
|
||||
|
||||
- precision-recall plots,
|
||||
- ROC curves,
|
||||
- confusion matrices,
|
||||
- example detection plots,
|
||||
- saved prediction files.
|
||||
|
||||
If you want to control which tasks run and which plots are generated, see
|
||||
{doc}`../reference/evaluation-config` and
|
||||
{doc}`../how_to/choose-and-configure-evaluation-tasks`.
|
||||
|
||||
## Common next steps
|
||||
|
||||
- Choose a different model:
|
||||
{doc}`../how_to/choose-a-model`
|
||||
- Configure evaluation tasks:
|
||||
{doc}`../how_to/choose-and-configure-evaluation-tasks`
|
||||
- Interpret evaluation artifacts:
|
||||
|
||||
Loading…
Reference in New Issue
Block a user