Batdetect2 Datasets

This document describes the datasets used to train and evaluate the batdetect2 model for acoustic bat detection and classification.

batdetect2 was trained using a combination of datasets, primarily collected and annotated in partnership with the Bat Conservation Trust (BCT). The data sources include:

Dedicated recordings: The majority of the data comprises focal recordings of individual bats with known species identifications. These were collected specifically for this project in collaboration with the BCT and partners.
BatDetective project: A subset of recordings was selected from the BatDetective project and re-annotated for this project.
Queen Elizabeth Olympic Park: Recordings were sourced from the acoustic monitoring network in East London.

Annotation Approach

Most datasets contain annotations with species-level identification. However, the BatDetective and bat_logger* datasets lack species identification for individual calls. This is because these recordings were collected passively without on-site species verification. Annotations for these datasets include bounding boxes around echolocation calls but no species labels.

Dataset Summary

The following table provides an overview of each dataset used in training and evaluating batdetect2. It includes the location of the dataset files and whether the annotations include species identification ("Species ID").

Dataset	Dataset Path	Species ID
BatDetective	bat_detective_batdetect2	no
bat_logger_qeop_empty	bat_logger_qeop_empty	no
bat_logger_2016_empty	bat_logger_2016	no
echobank	echobank_batdetect2	yes
sn_scot_nor	sn_scot_nor	yes
bct_1_sec	bct_1_sec	yes
bcireland	bcireland	yes
rhinolophus_bct	rhinolophus_bct	yes
bat_data_2018	bat_data_2018	yes
bat_data_2018_test	bat_data_2018_test	yes
bat_data_2019	bat_data_2019	yes
bat_data_2019_test	bat_data_2019_test	yes

Train Test splits

To ensure a robust and unbiased evaluation of batdetect2, we carefully considered how to split the data into training and test sets. Due to potential dependencies within datasets (e.g., recordings from the same site, date, or even individual), we implemented two distinct splitting strategies:

Split Diff:

This strategy assigns entire datasets to either the training or test set. This maximizes the independence between training and testing data, reducing the chance of the model encountering similar recordings in both sets.
Split Same:

In this approach, each dataset is individually split into training and test subsets. This means that recordings in the test set might share similarities with those in the training set (e.g., same site, methodology). This split helps assess the model's ability to generalize to new recordings within familiar contexts.

For each split, the corresponding annotations were organized into JSON files. The following tables list the annotation sets used for training and testing, along with their associated dataset:

Split Diff Summary

Dataset	Annotation Set Name	Is Test
BatDetective	train_set_bulgaria_batdetective_with_bbs.json	no
bat_logger_qeop_empty	bat_logger_qeop_empty.json	no
bat_logger_2016	train_set_bat_logger_2016_empty.json	no
echobank_batdetect2	Echobank_train_expert.json	no
sn_scot_nor	sn_scot_nor_0.5_expert.json	no
bct_1_sec	bct_1_sec_train_expert.json	no
bcireland	bcireland_expert.json	no
rhinolophus_bct	rhinolophus_BCT_expert.json	no
bat_data_2018	BritishBatCalls_2018_1_sec_train_expert.json	yes
bat_data_2018_test	BritishBatCalls_2018_1_sec_test_expert.json	yes
bat_data_2019	BritishBatCalls_2019_1_sec_test_expert.json	yes
bat_data_2019_test	BritishBatCalls_2019_1_sec_test_expert.json	yes

Split Same Summary

Dataset	Annotation Set Name	Is Test
BatDetective	train_set_bulgaria_batdetective_with_bbs.json	no
bat_logger_qeop_empty	bat_logger_qeop_empty.json	no
bat_logger_2016	train_set_bat_logger_2016_empty.json	no
echobank	Echobank_train_expert_TRAIN.json	no
sn_scot_nor	sn_scot_nor_0.5_expert_TRAIN.json	no
bct_1_sec	BCT_1_sec_train_expert_TRAIN.json	no
bcireland	bcireland_expert_TRAIN.json	no
rhinolophus_bct	rhinolophus_BCT_expert_TRAIN.json	no
bat_data_2018	BritishBatCalls_2018_1_sec_train_expert_TRAIN.json	no
bat_data_2018_test	BritishBatCalls_2018_1_sec_test_expert_TRAIN.json	no
bat_data_2019	BritishBatCalls_2019_1_sec_train_expert_TRAIN.json	no
bat_data_2019_test	BritishBatCalls_2019_1_sec_test_expert_TRAIN.json	no
echobank	Echobank_train_expert_TEST.json	yes
sn_scot_nor	sn_scot_nor_0.5_expert_TEST.json	yes
BCT_1_sec	BCT_1_sec_train_expert_TEST.json	yes
bcireland	bcireland_expert_TEST.json	yes
rhinolophus_bct	rhinolophus_BCT_expert_TEST.json	yes
bat_data_2018	BritishBatCalls_2018_1_sec_train_expert_TEST.json	yes
bat_data_2018_test	BritishBatCalls_2018_1_sec_test_expert_TEST.json	yes
bat_data_2019	BritishBatCalls_2019_1_sec_train_expert_TEST.json	yes
bat_data_2019_test	BritishBatCalls_2019_1_sec_test_expert_TEST.json	yes

File structure

Each dataset is organized within a separate folder, as listed in the Dataset Summary table. Within each dataset folder, you'll find the following subfolders:

audio: Contains all the raw WAV audio recordings in a flat structure.
annotation_sets: Contains JSON files that gather all annotations for the different train-test splits. For example, the rhinolophus_BCT_expert_TRAIN.json annotation set for the rhinolophus_bct dataset would be located at rhinolophus_bct/annotation_sets/rhinolophus_BCT_expert_TRAIN.json.
annotations: (Optional) This folder contains individual JSON files for each recording in the dataset, storing all annotations for the corresponding recording.

7.3 KiB Raw Blame History