Add documentation on how to use the legacy format

This commit is contained in:
mbsantiago 2025-04-19 12:25:20 +01:00
parent a0a77cada1
commit ae6063918c
3 changed files with 124 additions and 1 deletions

View File

@ -1,4 +1,4 @@
# Using AOEF / Soundevent Data Sources (`format: "aoef"`)
# Using AOEF / Soundevent Data Sources
## Introduction

View File

@ -5,4 +5,5 @@
:caption: Loading Data
aoef
legacy
```

122
docs/source/data/legacy.md Normal file
View File

@ -0,0 +1,122 @@
# Using Legacy BatDetect2 Annotation Formats
## Introduction
If you have annotation data created using older BatDetect2 annotation tools, BatDetect2 provides tools to load these datasets.
These older formats typically use JSON files to store annotation information, including bounding boxes and labels for sound events within recordings.
There are two main variations of this legacy format that BatDetect2 can load:
1. **Directory-Based (`format: "batdetect2"`):** Annotations for each audio recording are stored in a _separate_ JSON file within a dedicated directory.
There's a naming convention linking the JSON file to its corresponding audio file (e.g., `my_recording.wav` annotations are stored in `my_recording.wav.json`).
2. **Single Merged File (`format: "batdetect2_file"`):** Annotations for _multiple_ recordings are aggregated into a _single_ JSON file.
This file contains a list, where each item represents the annotations for one recording, following the same internal structure as the directory-based format.
When you configure BatDetect2 to use these formats, it will read the legacy data and convert it internally into the standard `soundevent` data structures used by the rest of the pipeline.
## Configuration
You specify which legacy format to use within the `sources` list of your main `DatasetConfig` (usually in your primary YAML configuration file).
### Format 1: Directory-Based
Use this when you have a folder containing many individual JSON annotation files, one for each audio file.
**Configuration Fields:**
- `format: "batdetect2"`: **(Required)** Identifies this specific legacy format loader.
- `name: your_source_name`: **(Required)** A unique name for this data source.
- `audio_dir: path/to/audio/files`: **(Required)** Path to the directory containing the `.wav` audio files.
- `annotations_dir: path/to/annotation/jsons`: **(Required)** Path to the directory containing the individual `.json` annotation files.
- `description: "Details..."`: (Optional) Description of this source.
- `filter: ...`: (Optional) Settings to filter which JSON files are processed based on flags within them (see "Filtering Legacy Annotations" below).
**YAML Example:**
```yaml
# In your main DatasetConfig YAML file
sources:
- name: "OldProject_SiteA_Files"
format: "batdetect2" # Use the directory-based loader
audio_dir: "/data/SiteA/Audio/"
annotations_dir: "/data/SiteA/Annotations_JSON/"
description: "Legacy annotations stored as individual JSONs per recording."
# filter: ... # Optional filter settings can be added here
```
### Format 2: Single Merged File
Use this when you have a single JSON file that contains a list of annotations for multiple recordings.
**Configuration Fields:**
- `format: "batdetect2_file"`: **(Required)** Identifies this specific legacy format loader.
- `name: your_source_name`: **(Required)** A unique name for this data source.
- `audio_dir: path/to/audio/files`: **(Required)** Path to the directory containing the `.wav` audio files referenced _within_ the merged JSON file.
- `annotations_path: path/to/your/merged_annotations.json`: **(Required)** Path to the single `.json` file containing the list of annotations.
- `description: "Details..."`: (Optional) Description of this source.
- `filter: ...`: (Optional) Settings to filter which records _within_ the merged file are processed (see "Filtering Legacy Annotations" below).
**YAML Example:**
```yaml
# In your main DatasetConfig YAML file
sources:
- name: "OldProject_Merged"
format: "batdetect2_file" # Use the merged file loader
audio_dir: "/data/AllAudio/"
annotations_path: "/data/CombinedAnnotations/old_project_merged.json"
description: "Legacy annotations aggregated into a single JSON file."
# filter: ... # Optional filter settings can be added here
```
## Filtering Legacy Annotations
The legacy JSON annotation structure (for both formats) included boolean flags indicating the status of the annotation work for each recording:
- `annotated`: Typically `true` if a human had reviewed or created annotations for the file.
- `issues`: Typically `true` if problems were noted during annotation or review.
You can optionally filter the data based on these flags using a `filter:` block within the source configuration.
This applies whether you use `"batdetect2"` or `"batdetect2_file"`.
**Filter Options:**
- `only_annotated: <true_or_false>`:
- `true` (**Default**): Only process entries where the `annotated` flag in the JSON is `true`.
- `false`: Process entries regardless of the `annotated` flag.
- `exclude_issues: <true_or_false>`:
- `true` (**Default**): Skip processing entries where the `issues` flag in the JSON is `true`.
- `false`: Process entries even if they are flagged with `issues`.
**Default Filtering:** If you **omit** the `filter:` block entirely, the default settings (`only_annotated: true`, `exclude_issues: true`) are applied automatically.
This means only entries marked as annotated and not having issues will be loaded.
**Disabling Filtering:** To load _all_ entries from the legacy source regardless of the `annotated` or `issues` flags, explicitly disable the filter:
```yaml
filter: null
```
**YAML Example (Custom Filter):** Only load entries marked as annotated, but _include_ those with issues.
```yaml
sources:
- name: "LegacyData_WithIssues"
format: "batdetect2" # Or "batdetect2_file"
audio_dir: "path/to/audio"
annotations_dir: "path/to/annotations" # Or annotations_path for merged
filter:
only_annotated: true
exclude_issues: false # Include entries even if issues flag is true
```
## Summary
BatDetect2 allows you to incorporate datasets stored in older "BatDetect2" JSON formats.
- Use `format: "batdetect2"` and provide `annotations_dir` if you have one JSON file per recording in a directory.
- Use `format: "batdetect2_file"` and provide `annotations_path` if you have a single JSON file containing annotations for multiple recordings.
- Optionally use the `filter:` block with `only_annotated` and `exclude_issues` to select data based on flags present in the legacy JSON structure.
The system will handle loading, filtering (if configured), and converting this legacy data into the standard `soundevent` format used internally.