batdetect2/docs/source/how_to/interpret-evaluation-outputs.md

1.3 KiB

How to interpret evaluation outputs

Use this guide after batdetect2 evaluate has written metrics and plots to disk.

Start by identifying the task

Do not interpret a metric until you know which evaluation task produced it.

For example, a detection score and a clip-classification score answer different questions.

Read the output directory as a bundle

Treat the evaluation output directory as one package:

  • metrics,
  • plots,
  • saved predictions,
  • config context.

Do not lift a single number out of context and treat it as the whole story.

Look for failure patterns, not just overall averages

Check:

  • whether errors concentrate in certain taxa,
  • whether specific sites or recorder setups behave differently,
  • whether threshold choices are driving the result,
  • whether predictions are near clip boundaries or matching thresholds.

Keep validation and deployment questions separate

A model can look good on one task and still be a poor fit for your deployment question.

Interpret the outputs in relation to the real use case, not only the easiest metric to report.

  • Evaluation tutorial: {doc}../tutorials/evaluate-on-a-test-set
  • Evaluation concepts: {doc}../explanation/evaluation-concepts-and-matching
  • Model output and validation: {doc}../explanation/model-output-and-validation