Updating training preprocess notebook

2025-06-29 22:51:58 +02:00 · 2024-10-15 22:38:47 +01:00 · 2024-10-15 22:38:47 +01:00 · d84b7795f6
commit d84b7795f6
parent f63307757c
4 changed files with 3362 additions and 8 deletions
--- a/datasets.md
+++ b/datasets.md
@ -0,0 +1,110 @@
+# Batdetect2 Datasets
+
+This document describes the datasets used to train and evaluate the `batdetect2` model for acoustic bat detection and classification.
+
+`batdetect2` was trained using a combination of datasets, primarily collected and annotated in partnership with the Bat Conservation Trust (BCT).
+The data sources include:
+
+- **Dedicated recordings**: The majority of the data comprises focal recordings of individual bats with known species identifications.
+    These were collected specifically for this project in collaboration with the BCT and partners.
+- **BatDetective project**: A subset of recordings was selected from the BatDetective project and re-annotated for this project.
+- **Queen Elizabeth Olympic Park**: Recordings were sourced from the acoustic monitoring network in East London.
+
+## Annotation Approach
+
+Most datasets contain annotations with species-level identification.
+However, the `BatDetective` and `bat_logger*` datasets lack species identification for individual calls.
+This is because these recordings were collected passively without on-site species verification.
+Annotations for these datasets include bounding boxes around echolocation calls but no species labels.
+
+## Dataset Summary
+
+The following table provides an overview of each dataset used in training and evaluating `batdetect2`.
+It includes the location of the dataset files and whether the annotations include species identification ("Species ID").
+
+| Dataset               | Dataset Path             | Species ID |
+| --------------------- | ------------------------ | ---------- |
+| BatDetective          | bat_detective_batdetect2 | no         |
+| bat_logger_qeop_empty | bat_logger_qeop_empty    | no         |
+| bat_logger_2016_empty | bat_logger_2016          | no         |
+| echobank              | echobank_batdetect2      | yes        |
+| sn_scot_nor           | sn_scot_nor              | yes        |
+| bct_1_sec             | bct_1_sec                | yes        |
+| bcireland             | bcireland                | yes        |
+| rhinolophus_bct       | rhinolophus_bct          | yes        |
+| bat_data_2018         | bat_data_2018            | yes        |
+| bat_data_2018_test    | bat_data_2018_test       | yes        |
+| bat_data_2019         | bat_data_2019            | yes        |
+| bat_data_2019_test    | bat_data_2019_test       | yes        |
+
+## Train Test splits
+
+To ensure a robust and unbiased evaluation of `batdetect2`, we carefully considered how to split the data into training and test sets.
+Due to potential dependencies within datasets (e.g., recordings from the same site, date, or even individual), we implemented two distinct splitting strategies:
+
+1. Split Diff:
+
+   This strategy assigns entire datasets to either the training or test set.
+      This maximizes the independence between training and testing data, reducing the chance of the model encountering similar recordings in both sets.
+
+2. Split Same:
+
+   In this approach, each dataset is individually split into training and test subsets.
+      This means that recordings in the test set might share similarities with those in the training set (e.g., same site, methodology).
+      This split helps assess the model's ability to generalize to new recordings within familiar contexts.
+
+For each split, the corresponding annotations were organized into JSON files.
+The following tables list the annotation sets used for training and testing, along with their associated dataset:
+
+### Split Diff Summary
+
+| Dataset               | Annotation Set Name                           | Is Test |
+| --------------------- | --------------------------------------------- | ------- |
+| BatDetective          | train_set_bulgaria_batdetective_with_bbs.json | no      |
+| bat_logger_qeop_empty | bat_logger_qeop_empty.json                    | no      |
+| bat_logger_2016       | train_set_bat_logger_2016_empty.json          | no      |
+| echobank_batdetect2   | Echobank_train_expert.json                    | no      |
+| sn_scot_nor           | sn_scot_nor_0.5_expert.json                   | no      |
+| bct_1_sec             | bct_1_sec_train_expert.json                   | no      |
+| bcireland             | bcireland_expert.json                         | no      |
+| rhinolophus_bct       | rhinolophus_BCT_expert.json                   | no      |
+| bat_data_2018         | BritishBatCalls_2018_1_sec_train_expert.json  | yes     |
+| bat_data_2018_test    | BritishBatCalls_2018_1_sec_test_expert.json   | yes     |
+| bat_data_2019         | BritishBatCalls_2019_1_sec_test_expert.json   | yes     |
+| bat_data_2019_test    | BritishBatCalls_2019_1_sec_test_expert.json   | yes     |
+
+### Split Same Summary
+
+| Dataset               | Annotation Set Name                                | Is Test |
+| --------------------- | -------------------------------------------------- | ------- |
+| BatDetective          | train_set_bulgaria_batdetective_with_bbs.json      | no      |
+| bat_logger_qeop_empty | bat_logger_qeop_empty.json                         | no      |
+| bat_logger_2016       | train_set_bat_logger_2016_empty.json               | no      |
+| echobank              | Echobank_train_expert_TRAIN.json                   | no      |
+| sn_scot_nor           | sn_scot_nor_0.5_expert_TRAIN.json                  | no      |
+| bct_1_sec             | BCT_1_sec_train_expert_TRAIN.json                  | no      |
+| bcireland             | bcireland_expert_TRAIN.json                        | no      |
+| rhinolophus_bct       | rhinolophus_BCT_expert_TRAIN.json                  | no      |
+| bat_data_2018         | BritishBatCalls_2018_1_sec_train_expert_TRAIN.json | no      |
+| bat_data_2018_test    | BritishBatCalls_2018_1_sec_test_expert_TRAIN.json  | no      |
+| bat_data_2019         | BritishBatCalls_2019_1_sec_train_expert_TRAIN.json | no      |
+| bat_data_2019_test    | BritishBatCalls_2019_1_sec_test_expert_TRAIN.json  | no      |
+| echobank              | Echobank_train_expert_TEST.json                    | yes     |
+| sn_scot_nor           | sn_scot_nor_0.5_expert_TEST.json                   | yes     |
+| BCT_1_sec             | BCT_1_sec_train_expert_TEST.json                   | yes     |
+| bcireland             | bcireland_expert_TEST.json                         | yes     |
+| rhinolophus_bct       | rhinolophus_BCT_expert_TEST.json                   | yes     |
+| bat_data_2018         | BritishBatCalls_2018_1_sec_train_expert_TEST.json  | yes     |
+| bat_data_2018_test    | BritishBatCalls_2018_1_sec_test_expert_TEST.json   | yes     |
+| bat_data_2019         | BritishBatCalls_2019_1_sec_train_expert_TEST.json  | yes     |
+| bat_data_2019_test    | BritishBatCalls_2019_1_sec_test_expert_TEST.json   | yes     |
+
+## File structure
+
+Each dataset is organized within a separate folder, as listed in the Dataset Summary table.
+Within each dataset folder, you'll find the following subfolders:
+
+- `audio`: Contains all the raw WAV audio recordings in a flat structure.
+- `annotation_sets`: Contains JSON files that gather all annotations for the different train-test splits.
+    For example, the `rhinolophus_BCT_expert_TRAIN.json` annotation set for the `rhinolophus_bct` dataset would be located at `rhinolophus_bct/annotation_sets/rhinolophus_BCT_expert_TRAIN.json`.
+- `annotations`: (Optional) This folder contains individual JSON files for each recording in the dataset, storing all annotations for the corresponding recording.
--- a/notebooks/Training
+++ b/notebooks/Training
@ -179,10 +179,21 @@
    "    ]\n",
    "\n",
    "    # Classify a given sound event annotation\n",
-    "    def encode(self, x: data.SoundEventAnnotation) -> Optional[str]:\n",
+    "    def encode(self, sound_event_annotation: data.SoundEventAnnotation) -> Optional[str]:\n",
    "\n",
    "        # Extract the \"event\" tag (e.g., \"Echolocation\" or \"Social\")\n",
-    "        event_tag = data.find_tag(x.tags, \"event\")\n",
+    "        event_tag = data.find_tag(sound_event_annotation.tags, \"event\")\n",
+    "\n",
+    "        # Extract the \"class\" tag (species) for echolocation calls\n",
+    "        species_tag = data.find_tag(sound_event_annotation.tags, \"class\")\n",
+    "\n",
+    "        if event_tag is None and species_tag is None:\n",
+    "            return None\n",
+    "\n",
+    "        assert species_tag is not None\n",
+    "\n",
+    "        if event_tag is None:\n",
+    "            return species_tag.value\n",
    "\n",
    "        # If it's a social call, return \"social\" as the class\n",
    "        if event_tag.value == \"Social\":\n",
@ -192,16 +203,14 @@
    "        if event_tag.value != \"Echolocation\":\n",
    "            return None\n",
    "\n",
-    "        # Extract the \"class\" tag (species) for echolocation calls\n",
-    "        species_tag = data.find_tag(x.tags, \"class\")\n",
    "        return species_tag.value\n",
    "\n",
    "    # Convert a class prediction back into annotation tags\n",
-    "    def decode(self, class_name: str) -> List[data.Tag]:\n",
-    "        if class_name == \"social\":\n",
+    "    def decode(self, label: str) -> List[data.Tag]:\n",
+    "        if label == \"social\":\n",
    "            return [data.Tag(key=\"event\", value=\"social\")]\n",
    "\n",
-    "        return [data.Tag(key=\"class\", value=class_name)]"
+    "        return [data.Tag(key=\"class\", value=label)]"
   ]
  },
  {
--- a/pyproject.toml
+++ b/pyproject.toml
@ -1,5 +1,5 @@
 [tool]
-rye = { dev-dependencies = [
+uv = { dev-dependencies = [
    "ipykernel>=6.29.4",
    "setuptools>=69.5.1",
    "pytest>=8.1.1",
--- a/uv.lock
+++ b/uv.lock