6.2 KiB
Using Legacy BatDetect2 Annotation Formats
Introduction
If you have annotation data created using older BatDetect2 annotation tools, BatDetect2 provides tools to load these datasets. These older formats typically use JSON files to store annotation information, including bounding boxes and labels for sound events within recordings.
There are two main variations of this legacy format that BatDetect2 can load:
- Directory-Based (
format: "batdetect2"
): Annotations for each audio recording are stored in a separate JSON file within a dedicated directory. There's a naming convention linking the JSON file to its corresponding audio file (e.g.,my_recording.wav
annotations are stored inmy_recording.wav.json
). - Single Merged File (
format: "batdetect2_file"
): Annotations for multiple recordings are aggregated into a single JSON file. This file contains a list, where each item represents the annotations for one recording, following the same internal structure as the directory-based format.
When you configure BatDetect2 to use these formats, it will read the legacy data and convert it internally into the standard soundevent
data structures used by the rest of the pipeline.
Configuration
You specify which legacy format to use within the sources
list of your main DatasetConfig
(usually in your primary YAML configuration file).
Format 1: Directory-Based
Use this when you have a folder containing many individual JSON annotation files, one for each audio file.
Configuration Fields:
format: "batdetect2"
: (Required) Identifies this specific legacy format loader.name: your_source_name
: (Required) A unique name for this data source.audio_dir: path/to/audio/files
: (Required) Path to the directory containing the.wav
audio files.annotations_dir: path/to/annotation/jsons
: (Required) Path to the directory containing the individual.json
annotation files.description: "Details..."
: (Optional) Description of this source.filter: ...
: (Optional) Settings to filter which JSON files are processed based on flags within them (see "Filtering Legacy Annotations" below).
YAML Example:
# In your main DatasetConfig YAML file
sources:
- name: "OldProject_SiteA_Files"
format: "batdetect2" # Use the directory-based loader
audio_dir: "/data/SiteA/Audio/"
annotations_dir: "/data/SiteA/Annotations_JSON/"
description: "Legacy annotations stored as individual JSONs per recording."
# filter: ... # Optional filter settings can be added here
Format 2: Single Merged File
Use this when you have a single JSON file that contains a list of annotations for multiple recordings.
Configuration Fields:
format: "batdetect2_file"
: (Required) Identifies this specific legacy format loader.name: your_source_name
: (Required) A unique name for this data source.audio_dir: path/to/audio/files
: (Required) Path to the directory containing the.wav
audio files referenced within the merged JSON file.annotations_path: path/to/your/merged_annotations.json
: (Required) Path to the single.json
file containing the list of annotations.description: "Details..."
: (Optional) Description of this source.filter: ...
: (Optional) Settings to filter which records within the merged file are processed (see "Filtering Legacy Annotations" below).
YAML Example:
# In your main DatasetConfig YAML file
sources:
- name: "OldProject_Merged"
format: "batdetect2_file" # Use the merged file loader
audio_dir: "/data/AllAudio/"
annotations_path: "/data/CombinedAnnotations/old_project_merged.json"
description: "Legacy annotations aggregated into a single JSON file."
# filter: ... # Optional filter settings can be added here
Filtering Legacy Annotations
The legacy JSON annotation structure (for both formats) included boolean flags indicating the status of the annotation work for each recording:
annotated
: Typicallytrue
if a human had reviewed or created annotations for the file.issues
: Typicallytrue
if problems were noted during annotation or review.
You can optionally filter the data based on these flags using a filter:
block within the source configuration.
This applies whether you use "batdetect2"
or "batdetect2_file"
.
Filter Options:
only_annotated: <true_or_false>
:true
(Default): Only process entries where theannotated
flag in the JSON istrue
.false
: Process entries regardless of theannotated
flag.
exclude_issues: <true_or_false>
:true
(Default): Skip processing entries where theissues
flag in the JSON istrue
.false
: Process entries even if they are flagged withissues
.
Default Filtering: If you omit the filter:
block entirely, the default settings (only_annotated: true
, exclude_issues: true
) are applied automatically.
This means only entries marked as annotated and not having issues will be loaded.
Disabling Filtering: To load all entries from the legacy source regardless of the annotated
or issues
flags, explicitly disable the filter:
filter: null
YAML Example (Custom Filter): Only load entries marked as annotated, but include those with issues.
sources:
- name: "LegacyData_WithIssues"
format: "batdetect2" # Or "batdetect2_file"
audio_dir: "path/to/audio"
annotations_dir: "path/to/annotations" # Or annotations_path for merged
filter:
only_annotated: true
exclude_issues: false # Include entries even if issues flag is true
Summary
BatDetect2 allows you to incorporate datasets stored in older "BatDetect2" JSON formats.
- Use
format: "batdetect2"
and provideannotations_dir
if you have one JSON file per recording in a directory. - Use
format: "batdetect2_file"
and provideannotations_path
if you have a single JSON file containing annotations for multiple recordings. - Optionally use the
filter:
block withonly_annotated
andexclude_issues
to select data based on flags present in the legacy JSON structure.
The system will handle loading, filtering (if configured), and converting this legacy data into the standard soundevent
format used internally.