Evaluation concepts and matching

Evaluation is not just "run predictions and compute one number".

The reported metric depends on the evaluation task, the matching rule, and the treatment of clip boundaries and generic labels.

Task families answer different questions

Built-in task families include:

Choose the task that matches the scientific or engineering question.

For sound-event-style tasks, predictions and annotations are matched using an affinity function.

Important controls include:

Small changes here can change the reported metric without changing the underlying predictions.

The evaluation base task can exclude events near clip boundaries through ignore_start_end.

This is useful when clip boundaries make matches ambiguous.

Classification tasks can include or exclude generic targets depending on configuration.

That affects what counts as a valid class-level comparison.