mirror of
https://github.com/macaodha/batdetect2.git
synced 2026-05-22 22:32:18 +02:00
docs: refresh evaluation tutorial outputs and setup
This commit is contained in:
parent
5760b6e017
commit
7a0f5c4b5a
@ -1,92 +1,133 @@
|
|||||||
# Tutorial: Evaluate on a test set
|
# Evaluate on a test set
|
||||||
|
|
||||||
This tutorial shows how to evaluate a trained checkpoint on a held-out dataset
|
This tutorial shows how to evaluate a trained checkpoint on a held-out dataset
|
||||||
and inspect the output metrics.
|
and inspect the output metrics.
|
||||||
|
|
||||||
This tutorial is for advanced users who want to compare one trained model
|
Use it when you want to measure how a model performs on labelled data that was
|
||||||
against a separate test dataset.
|
kept aside for testing.
|
||||||
|
|
||||||
## Before you start
|
## Before you start
|
||||||
|
|
||||||
- A trained model checkpoint.
|
You need:
|
||||||
- A test dataset config file.
|
|
||||||
- (Optional) Targets, audio, inference, and evaluation config overrides.
|
- a test dataset config,
|
||||||
|
- a trained checkpoint or model alias.
|
||||||
|
|
||||||
```{note}
|
```{note}
|
||||||
This page is for model evaluation.
|
This page is for model evaluation.
|
||||||
If you only want to run BatDetect2 on recordings,
|
If you only want to run BatDetect2 on recordings, start with
|
||||||
start with {doc}`run-inference-on-folder` instead.
|
{doc}`run-inference-on-folder` instead.
|
||||||
```
|
```
|
||||||
|
|
||||||
## Outcome
|
## What you will do
|
||||||
|
|
||||||
By the end of this tutorial you will have:
|
By the end of this tutorial you will have:
|
||||||
|
|
||||||
|
- prepared a test dataset config,
|
||||||
- run `batdetect2 evaluate`,
|
- run `batdetect2 evaluate`,
|
||||||
- written evaluation metrics and result files,
|
- written evaluation metrics and result files,
|
||||||
- understood what to inspect first,
|
- identified the next pages for model choice and evaluation configuration.
|
||||||
- identified the next pages for evaluation concepts and configuration.
|
|
||||||
|
|
||||||
## 1. Start with a held-out dataset
|
## 1. Create a test dataset config
|
||||||
|
|
||||||
|
Evaluation needs a dataset config that points to the labelled data you want to
|
||||||
|
use for testing.
|
||||||
|
|
||||||
|
This is the same kind of dataset config used for training.
|
||||||
|
It explicitly declares which data sources BatDetect2 should read, including the
|
||||||
|
audio files and their annotations.
|
||||||
|
|
||||||
|
For an example, see `example_data/dataset.yaml`.
|
||||||
|
|
||||||
|
If you need help creating the dataset config, follow the dataset section in
|
||||||
|
{doc}`train-a-custom-model`.
|
||||||
|
For more detail on dataset source formats, see {doc}`../reference/data-sources`.
|
||||||
|
|
||||||
Use a dataset that was not used for training or tuning.
|
Use a dataset that was not used for training or tuning.
|
||||||
|
|
||||||
A held-out dataset is simply a separate dataset kept aside for evaluation.
|
|
||||||
|
|
||||||
If you tune thresholds or configs on the same dataset that you report as final
|
|
||||||
evaluation, the results will be optimistic.
|
|
||||||
|
|
||||||
## 2. Run evaluation
|
## 2. Run evaluation
|
||||||
|
|
||||||
|
For a simple run, use:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
batdetect2 evaluate \
|
||||||
|
path/to/test_dataset.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
If you do not pass `--model`, BatDetect2 uses the built-in default UK model.
|
||||||
|
If you want to choose a different checkpoint, alias, or Hugging Face model, see
|
||||||
|
{doc}`../how_to/choose-a-model`.
|
||||||
|
|
||||||
|
If you want to save the results somewhere else, add `--output-dir`:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
batdetect2 evaluate \
|
batdetect2 evaluate \
|
||||||
path/to/test_dataset.yaml \
|
path/to/test_dataset.yaml \
|
||||||
--model path/to/model.ckpt \
|
--model path/to/model.ckpt \
|
||||||
--base-dir path/to/project_root \
|
|
||||||
--output-dir path/to/eval_outputs
|
--output-dir path/to/eval_outputs
|
||||||
```
|
```
|
||||||
|
|
||||||
This command loads the checkpoint, runs prediction on the test dataset, applies
|
This command loads the model, runs prediction on the test dataset, applies the
|
||||||
the chosen evaluation tasks, and writes metrics and result files to the output
|
evaluation tasks, and writes the results to the output directory.
|
||||||
directory.
|
|
||||||
|
|
||||||
Use `--base-dir` whenever the dataset config contains relative paths.
|
## 3. Check the output files
|
||||||
|
|
||||||
That is the common case for project-local dataset files.
|
By default, the CLI writes evaluation outputs to `outputs/evaluation`.
|
||||||
|
|
||||||
## 3. Inspect the output directory
|
With the default evaluation config, a run will usually create a folder like
|
||||||
|
this:
|
||||||
|
|
||||||
Look for:
|
```text
|
||||||
|
outputs/evaluation/
|
||||||
|
version_0/
|
||||||
|
metrics.csv
|
||||||
|
hparams.yaml
|
||||||
|
```
|
||||||
|
|
||||||
- summary metrics,
|
The most important file is `metrics.csv`.
|
||||||
- generated plots,
|
It contains the metric values computed for the evaluation run.
|
||||||
- saved prediction files if they were enabled,
|
|
||||||
- enough metadata to reproduce the run later.
|
|
||||||
|
|
||||||
The exact set depends on the configured evaluation tasks and plots.
|
A file like this might start like:
|
||||||
|
|
||||||
## 4. Interpret the results in context
|
```csv
|
||||||
|
classification/average_precision/barbar,classification/average_precision/cneser,...,detection/average_precision
|
||||||
|
0.898695170879364,0.9408193826675415,...,0.851219117641449
|
||||||
|
```
|
||||||
|
|
||||||
Do not reduce evaluation to a single number.
|
The exact columns depend on the evaluation tasks you run.
|
||||||
|
|
||||||
Check:
|
The `hparams.yaml` file records the config used for the evaluation run.
|
||||||
|
|
||||||
- which task the metric belongs to,
|
## 4. Expect extra plots and files when configs enable them
|
||||||
- which thresholding or matching assumptions were used,
|
|
||||||
- whether class-level behavior matches your use case,
|
|
||||||
- whether the failures are concentrated in specific taxa, sites, or recording
|
|
||||||
conditions.
|
|
||||||
|
|
||||||
## 5. Record the evaluation setup
|
You may also see extra outputs such as plots and saved predictions.
|
||||||
|
|
||||||
Keep the command, config files, checkpoint path, and dataset version together.
|
For example, if you run evaluation with `example_data/configs/evaluation.yaml`,
|
||||||
|
you should expect a richer output folder with:
|
||||||
|
|
||||||
That matters for reproducibility and for later model comparisons.
|
- `metrics.csv`
|
||||||
|
- `hparams.yaml`
|
||||||
|
- a `plots/` directory
|
||||||
|
- a `predictions/` directory
|
||||||
|
|
||||||
## What to do next
|
That config enables more evaluation tasks and plots than the default setup.
|
||||||
|
|
||||||
- Compare thresholds on representative files:
|
So, depending on your evaluation config, you may see files such as:
|
||||||
{doc}`../how_to/tune-detection-threshold`
|
|
||||||
|
- precision-recall plots,
|
||||||
|
- ROC curves,
|
||||||
|
- confusion matrices,
|
||||||
|
- example detection plots,
|
||||||
|
- saved prediction files.
|
||||||
|
|
||||||
|
If you want to control which tasks run and which plots are generated, see
|
||||||
|
{doc}`../reference/evaluation-config` and
|
||||||
|
{doc}`../how_to/choose-and-configure-evaluation-tasks`.
|
||||||
|
|
||||||
|
## Common next steps
|
||||||
|
|
||||||
|
- Choose a different model:
|
||||||
|
{doc}`../how_to/choose-a-model`
|
||||||
- Configure evaluation tasks:
|
- Configure evaluation tasks:
|
||||||
{doc}`../how_to/choose-and-configure-evaluation-tasks`
|
{doc}`../how_to/choose-and-configure-evaluation-tasks`
|
||||||
- Interpret evaluation artifacts:
|
- Interpret evaluation artifacts:
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user