mirror of
https://github.com/macaodha/batdetect2.git
synced 2025-06-29 22:51:58 +02:00
Updating training preprocess notebook
This commit is contained in:
parent
f63307757c
commit
d84b7795f6
110
datasets.md
Normal file
110
datasets.md
Normal file
@ -0,0 +1,110 @@
|
||||
# Batdetect2 Datasets
|
||||
|
||||
This document describes the datasets used to train and evaluate the `batdetect2` model for acoustic bat detection and classification.
|
||||
|
||||
`batdetect2` was trained using a combination of datasets, primarily collected and annotated in partnership with the Bat Conservation Trust (BCT).
|
||||
The data sources include:
|
||||
|
||||
- **Dedicated recordings**: The majority of the data comprises focal recordings of individual bats with known species identifications.
|
||||
These were collected specifically for this project in collaboration with the BCT and partners.
|
||||
- **BatDetective project**: A subset of recordings was selected from the BatDetective project and re-annotated for this project.
|
||||
- **Queen Elizabeth Olympic Park**: Recordings were sourced from the acoustic monitoring network in East London.
|
||||
|
||||
## Annotation Approach
|
||||
|
||||
Most datasets contain annotations with species-level identification.
|
||||
However, the `BatDetective` and `bat_logger*` datasets lack species identification for individual calls.
|
||||
This is because these recordings were collected passively without on-site species verification.
|
||||
Annotations for these datasets include bounding boxes around echolocation calls but no species labels.
|
||||
|
||||
## Dataset Summary
|
||||
|
||||
The following table provides an overview of each dataset used in training and evaluating `batdetect2`.
|
||||
It includes the location of the dataset files and whether the annotations include species identification ("Species ID").
|
||||
|
||||
| Dataset | Dataset Path | Species ID |
|
||||
| --------------------- | ------------------------ | ---------- |
|
||||
| BatDetective | bat_detective_batdetect2 | no |
|
||||
| bat_logger_qeop_empty | bat_logger_qeop_empty | no |
|
||||
| bat_logger_2016_empty | bat_logger_2016 | no |
|
||||
| echobank | echobank_batdetect2 | yes |
|
||||
| sn_scot_nor | sn_scot_nor | yes |
|
||||
| bct_1_sec | bct_1_sec | yes |
|
||||
| bcireland | bcireland | yes |
|
||||
| rhinolophus_bct | rhinolophus_bct | yes |
|
||||
| bat_data_2018 | bat_data_2018 | yes |
|
||||
| bat_data_2018_test | bat_data_2018_test | yes |
|
||||
| bat_data_2019 | bat_data_2019 | yes |
|
||||
| bat_data_2019_test | bat_data_2019_test | yes |
|
||||
|
||||
## Train Test splits
|
||||
|
||||
To ensure a robust and unbiased evaluation of `batdetect2`, we carefully considered how to split the data into training and test sets.
|
||||
Due to potential dependencies within datasets (e.g., recordings from the same site, date, or even individual), we implemented two distinct splitting strategies:
|
||||
|
||||
1. Split Diff:
|
||||
|
||||
This strategy assigns entire datasets to either the training or test set.
|
||||
This maximizes the independence between training and testing data, reducing the chance of the model encountering similar recordings in both sets.
|
||||
|
||||
2. Split Same:
|
||||
|
||||
In this approach, each dataset is individually split into training and test subsets.
|
||||
This means that recordings in the test set might share similarities with those in the training set (e.g., same site, methodology).
|
||||
This split helps assess the model's ability to generalize to new recordings within familiar contexts.
|
||||
|
||||
For each split, the corresponding annotations were organized into JSON files.
|
||||
The following tables list the annotation sets used for training and testing, along with their associated dataset:
|
||||
|
||||
### Split Diff Summary
|
||||
|
||||
| Dataset | Annotation Set Name | Is Test |
|
||||
| --------------------- | --------------------------------------------- | ------- |
|
||||
| BatDetective | train_set_bulgaria_batdetective_with_bbs.json | no |
|
||||
| bat_logger_qeop_empty | bat_logger_qeop_empty.json | no |
|
||||
| bat_logger_2016 | train_set_bat_logger_2016_empty.json | no |
|
||||
| echobank_batdetect2 | Echobank_train_expert.json | no |
|
||||
| sn_scot_nor | sn_scot_nor_0.5_expert.json | no |
|
||||
| bct_1_sec | bct_1_sec_train_expert.json | no |
|
||||
| bcireland | bcireland_expert.json | no |
|
||||
| rhinolophus_bct | rhinolophus_BCT_expert.json | no |
|
||||
| bat_data_2018 | BritishBatCalls_2018_1_sec_train_expert.json | yes |
|
||||
| bat_data_2018_test | BritishBatCalls_2018_1_sec_test_expert.json | yes |
|
||||
| bat_data_2019 | BritishBatCalls_2019_1_sec_test_expert.json | yes |
|
||||
| bat_data_2019_test | BritishBatCalls_2019_1_sec_test_expert.json | yes |
|
||||
|
||||
### Split Same Summary
|
||||
|
||||
| Dataset | Annotation Set Name | Is Test |
|
||||
| --------------------- | -------------------------------------------------- | ------- |
|
||||
| BatDetective | train_set_bulgaria_batdetective_with_bbs.json | no |
|
||||
| bat_logger_qeop_empty | bat_logger_qeop_empty.json | no |
|
||||
| bat_logger_2016 | train_set_bat_logger_2016_empty.json | no |
|
||||
| echobank | Echobank_train_expert_TRAIN.json | no |
|
||||
| sn_scot_nor | sn_scot_nor_0.5_expert_TRAIN.json | no |
|
||||
| bct_1_sec | BCT_1_sec_train_expert_TRAIN.json | no |
|
||||
| bcireland | bcireland_expert_TRAIN.json | no |
|
||||
| rhinolophus_bct | rhinolophus_BCT_expert_TRAIN.json | no |
|
||||
| bat_data_2018 | BritishBatCalls_2018_1_sec_train_expert_TRAIN.json | no |
|
||||
| bat_data_2018_test | BritishBatCalls_2018_1_sec_test_expert_TRAIN.json | no |
|
||||
| bat_data_2019 | BritishBatCalls_2019_1_sec_train_expert_TRAIN.json | no |
|
||||
| bat_data_2019_test | BritishBatCalls_2019_1_sec_test_expert_TRAIN.json | no |
|
||||
| echobank | Echobank_train_expert_TEST.json | yes |
|
||||
| sn_scot_nor | sn_scot_nor_0.5_expert_TEST.json | yes |
|
||||
| BCT_1_sec | BCT_1_sec_train_expert_TEST.json | yes |
|
||||
| bcireland | bcireland_expert_TEST.json | yes |
|
||||
| rhinolophus_bct | rhinolophus_BCT_expert_TEST.json | yes |
|
||||
| bat_data_2018 | BritishBatCalls_2018_1_sec_train_expert_TEST.json | yes |
|
||||
| bat_data_2018_test | BritishBatCalls_2018_1_sec_test_expert_TEST.json | yes |
|
||||
| bat_data_2019 | BritishBatCalls_2019_1_sec_train_expert_TEST.json | yes |
|
||||
| bat_data_2019_test | BritishBatCalls_2019_1_sec_test_expert_TEST.json | yes |
|
||||
|
||||
## File structure
|
||||
|
||||
Each dataset is organized within a separate folder, as listed in the Dataset Summary table.
|
||||
Within each dataset folder, you'll find the following subfolders:
|
||||
|
||||
- `audio`: Contains all the raw WAV audio recordings in a flat structure.
|
||||
- `annotation_sets`: Contains JSON files that gather all annotations for the different train-test splits.
|
||||
For example, the `rhinolophus_BCT_expert_TRAIN.json` annotation set for the `rhinolophus_bct` dataset would be located at `rhinolophus_bct/annotation_sets/rhinolophus_BCT_expert_TRAIN.json`.
|
||||
- `annotations`: (Optional) This folder contains individual JSON files for each recording in the dataset, storing all annotations for the corresponding recording.
|
@ -179,10 +179,21 @@
|
||||
" ]\n",
|
||||
"\n",
|
||||
" # Classify a given sound event annotation\n",
|
||||
" def encode(self, x: data.SoundEventAnnotation) -> Optional[str]:\n",
|
||||
" def encode(self, sound_event_annotation: data.SoundEventAnnotation) -> Optional[str]:\n",
|
||||
"\n",
|
||||
" # Extract the \"event\" tag (e.g., \"Echolocation\" or \"Social\")\n",
|
||||
" event_tag = data.find_tag(x.tags, \"event\")\n",
|
||||
" event_tag = data.find_tag(sound_event_annotation.tags, \"event\")\n",
|
||||
"\n",
|
||||
" # Extract the \"class\" tag (species) for echolocation calls\n",
|
||||
" species_tag = data.find_tag(sound_event_annotation.tags, \"class\")\n",
|
||||
"\n",
|
||||
" if event_tag is None and species_tag is None:\n",
|
||||
" return None\n",
|
||||
"\n",
|
||||
" assert species_tag is not None\n",
|
||||
"\n",
|
||||
" if event_tag is None:\n",
|
||||
" return species_tag.value\n",
|
||||
"\n",
|
||||
" # If it's a social call, return \"social\" as the class\n",
|
||||
" if event_tag.value == \"Social\":\n",
|
||||
@ -192,16 +203,14 @@
|
||||
" if event_tag.value != \"Echolocation\":\n",
|
||||
" return None\n",
|
||||
"\n",
|
||||
" # Extract the \"class\" tag (species) for echolocation calls\n",
|
||||
" species_tag = data.find_tag(x.tags, \"class\")\n",
|
||||
" return species_tag.value\n",
|
||||
"\n",
|
||||
" # Convert a class prediction back into annotation tags\n",
|
||||
" def decode(self, class_name: str) -> List[data.Tag]:\n",
|
||||
" if class_name == \"social\":\n",
|
||||
" def decode(self, label: str) -> List[data.Tag]:\n",
|
||||
" if label == \"social\":\n",
|
||||
" return [data.Tag(key=\"event\", value=\"social\")]\n",
|
||||
"\n",
|
||||
" return [data.Tag(key=\"class\", value=class_name)]"
|
||||
" return [data.Tag(key=\"class\", value=label)]"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -1,5 +1,5 @@
|
||||
[tool]
|
||||
rye = { dev-dependencies = [
|
||||
uv = { dev-dependencies = [
|
||||
"ipykernel>=6.29.4",
|
||||
"setuptools>=69.5.1",
|
||||
"pytest>=8.1.1",
|
||||
|
Loading…
Reference in New Issue
Block a user