diff --git a/docs/source/tutorials/evaluate-on-a-test-set.md b/docs/source/tutorials/evaluate-on-a-test-set.md index 6f8b2ed..d1d512e 100644 --- a/docs/source/tutorials/evaluate-on-a-test-set.md +++ b/docs/source/tutorials/evaluate-on-a-test-set.md @@ -1,92 +1,133 @@ -# Tutorial: Evaluate on a test set +# Evaluate on a test set This tutorial shows how to evaluate a trained checkpoint on a held-out dataset and inspect the output metrics. -This tutorial is for advanced users who want to compare one trained model -against a separate test dataset. +Use it when you want to measure how a model performs on labelled data that was +kept aside for testing. ## Before you start -- A trained model checkpoint. -- A test dataset config file. -- (Optional) Targets, audio, inference, and evaluation config overrides. +You need: + +- a test dataset config, +- a trained checkpoint or model alias. ```{note} This page is for model evaluation. -If you only want to run BatDetect2 on recordings, -start with {doc}`run-inference-on-folder` instead. +If you only want to run BatDetect2 on recordings, start with +{doc}`run-inference-on-folder` instead. ``` -## Outcome +## What you will do By the end of this tutorial you will have: +- prepared a test dataset config, - run `batdetect2 evaluate`, - written evaluation metrics and result files, -- understood what to inspect first, -- identified the next pages for evaluation concepts and configuration. +- identified the next pages for model choice and evaluation configuration. -## 1. Start with a held-out dataset +## 1. Create a test dataset config + +Evaluation needs a dataset config that points to the labelled data you want to +use for testing. + +This is the same kind of dataset config used for training. +It explicitly declares which data sources BatDetect2 should read, including the +audio files and their annotations. + +For an example, see `example_data/dataset.yaml`. + +If you need help creating the dataset config, follow the dataset section in +{doc}`train-a-custom-model`. +For more detail on dataset source formats, see {doc}`../reference/data-sources`. Use a dataset that was not used for training or tuning. -A held-out dataset is simply a separate dataset kept aside for evaluation. - -If you tune thresholds or configs on the same dataset that you report as final -evaluation, the results will be optimistic. - ## 2. Run evaluation +For a simple run, use: + +```bash +batdetect2 evaluate \ + path/to/test_dataset.yaml +``` + +If you do not pass `--model`, BatDetect2 uses the built-in default UK model. +If you want to choose a different checkpoint, alias, or Hugging Face model, see +{doc}`../how_to/choose-a-model`. + +If you want to save the results somewhere else, add `--output-dir`: + ```bash batdetect2 evaluate \ path/to/test_dataset.yaml \ --model path/to/model.ckpt \ - --base-dir path/to/project_root \ --output-dir path/to/eval_outputs ``` -This command loads the checkpoint, runs prediction on the test dataset, applies -the chosen evaluation tasks, and writes metrics and result files to the output -directory. +This command loads the model, runs prediction on the test dataset, applies the +evaluation tasks, and writes the results to the output directory. -Use `--base-dir` whenever the dataset config contains relative paths. +## 3. Check the output files -That is the common case for project-local dataset files. +By default, the CLI writes evaluation outputs to `outputs/evaluation`. -## 3. Inspect the output directory +With the default evaluation config, a run will usually create a folder like +this: -Look for: +```text +outputs/evaluation/ + version_0/ + metrics.csv + hparams.yaml +``` -- summary metrics, -- generated plots, -- saved prediction files if they were enabled, -- enough metadata to reproduce the run later. +The most important file is `metrics.csv`. +It contains the metric values computed for the evaluation run. -The exact set depends on the configured evaluation tasks and plots. +A file like this might start like: -## 4. Interpret the results in context +```csv +classification/average_precision/barbar,classification/average_precision/cneser,...,detection/average_precision +0.898695170879364,0.9408193826675415,...,0.851219117641449 +``` -Do not reduce evaluation to a single number. +The exact columns depend on the evaluation tasks you run. -Check: +The `hparams.yaml` file records the config used for the evaluation run. -- which task the metric belongs to, -- which thresholding or matching assumptions were used, -- whether class-level behavior matches your use case, -- whether the failures are concentrated in specific taxa, sites, or recording - conditions. +## 4. Expect extra plots and files when configs enable them -## 5. Record the evaluation setup +You may also see extra outputs such as plots and saved predictions. -Keep the command, config files, checkpoint path, and dataset version together. +For example, if you run evaluation with `example_data/configs/evaluation.yaml`, +you should expect a richer output folder with: -That matters for reproducibility and for later model comparisons. +- `metrics.csv` +- `hparams.yaml` +- a `plots/` directory +- a `predictions/` directory -## What to do next +That config enables more evaluation tasks and plots than the default setup. -- Compare thresholds on representative files: - {doc}`../how_to/tune-detection-threshold` +So, depending on your evaluation config, you may see files such as: + +- precision-recall plots, +- ROC curves, +- confusion matrices, +- example detection plots, +- saved prediction files. + +If you want to control which tasks run and which plots are generated, see +{doc}`../reference/evaluation-config` and +{doc}`../how_to/choose-and-configure-evaluation-tasks`. + +## Common next steps + +- Choose a different model: + {doc}`../how_to/choose-a-model` - Configure evaluation tasks: {doc}`../how_to/choose-and-configure-evaluation-tasks` - Interpret evaluation artifacts: