# Tutorial: Evaluate on a test set

This tutorial shows how to evaluate a trained checkpoint on a held-out dataset
and inspect the output metrics.

This tutorial is for advanced users who want to compare one trained model against a separate test dataset.

## Before you start

- A trained model checkpoint.
- A test dataset config file.
- (Optional) Targets, audio, inference, and evaluation config overrides.

```{note}
This page is for model evaluation.
If you only want to run BatDetect2 on recordings,
start with {doc}`run-inference-on-folder` instead.
```

## Outcome

By the end of this tutorial you will have:

- run `batdetect2 evaluate`,
- written evaluation metrics and result files,
- understood what to inspect first,
- identified the next pages for evaluation concepts and configuration.

## 1. Start with a held-out dataset

Use a dataset that was not used for training or tuning.

A held-out dataset is simply a separate dataset kept aside for evaluation.

If you tune thresholds or configs on the same dataset that you report as final evaluation, the results will be optimistic.

## 2. Run evaluation

```bash
batdetect2 evaluate \
  path/to/model.ckpt \
  path/to/test_dataset.yaml \
  --base-dir path/to/project_root \
  --output-dir path/to/eval_outputs
```

This command loads the checkpoint,
runs prediction on the test dataset,
applies the chosen evaluation tasks,
and writes metrics and result files to the output directory.

Use `--base-dir` whenever the dataset config contains relative paths.

That is the common case for project-local dataset files.

## 3. Inspect the output directory

Look for:

- summary metrics,
- generated plots,
- saved prediction files if they were enabled,
- enough metadata to reproduce the run later.

The exact set depends on the configured evaluation tasks and plots.

## 4. Interpret the results in context

Do not reduce evaluation to a single number.

Check:

- which task the metric belongs to,
- which thresholding or matching assumptions were used,
- whether class-level behavior matches your use case,
- whether the failures are concentrated in specific taxa, sites, or recording conditions.

## 5. Record the evaluation setup

Keep the command, config files, checkpoint path, and dataset version together.

That matters for reproducibility and for later model comparisons.

## What to do next

- Compare thresholds on representative files:
  {doc}`../how_to/tune-detection-threshold`
- Configure evaluation tasks: {doc}`../how_to/choose-and-configure-evaluation-tasks`
- Interpret evaluation artifacts: {doc}`../how_to/interpret-evaluation-outputs`
- Learn the evaluation concepts: {doc}`../explanation/evaluation-concepts-and-matching`
- Check full evaluate options: {doc}`../reference/cli/evaluate`