mirror of
https://github.com/macaodha/batdetect2.git
synced 2025-06-29 22:51:58 +02:00
Added more docs for preprocessing module
This commit is contained in:
parent
638f93fe92
commit
f314942628
@ -1,8 +1,46 @@
|
|||||||
# Preprocessing Audio
|
# Preprocessing Audio for BatDetect2
|
||||||
|
|
||||||
|
## What is Preprocessing?
|
||||||
|
|
||||||
|
Preprocessing refers to the steps taken to transform your raw audio recordings into a standardized format suitable for analysis by the BatDetect2 deep learning model.
|
||||||
|
This module (`batdetect2.preprocessing`) provides the tools to perform these transformations.
|
||||||
|
|
||||||
|
## Why is Preprocessing Important?
|
||||||
|
|
||||||
|
Applying a consistent preprocessing pipeline is important for several reasons:
|
||||||
|
|
||||||
|
1. **Standardization:** Audio recordings vary significantly depending on the equipment used, recording conditions, and settings (e.g., different sample rates, varying loudness levels, background noise).
|
||||||
|
Preprocessing helps standardize these aspects, making the data more uniform and allowing the model to learn relevant patterns more effectively.
|
||||||
|
2. **Model Requirements:** Deep learning models, particularly those like BatDetect2 that analyze 2D-patterns in spectrograms, are designed to work with specific input characteristics.
|
||||||
|
They often expect spectrograms of a certain size (time x frequency bins), with values represented on a particular scale (e.g., logarithmic/dB), and within a defined frequency range.
|
||||||
|
Preprocessing ensures the data meets these requirements.
|
||||||
|
3. **Consistency is Key:** Perhaps most importantly, the **exact same preprocessing steps** must be applied both when _training_ the model and when _using the trained model to make predictions_ (inference) on new data.
|
||||||
|
Any discrepancy between the preprocessing used during training and inference can significantly degrade the model's performance and lead to unreliable results.
|
||||||
|
BatDetect2's configurable pipeline ensures this consistency.
|
||||||
|
|
||||||
|
## How Preprocessing is Done in BatDetect2
|
||||||
|
|
||||||
|
BatDetect2 handles preprocessing through a configurable, two-stage pipeline:
|
||||||
|
|
||||||
|
1. **Audio Loading & Preparation:** This first stage deals with the raw audio waveform.
|
||||||
|
It involves loading the specified audio segment (from a file or clip), selecting a single channel (mono), optionally resampling it to a consistent sample rate, optionally adjusting its duration, and applying basic waveform conditioning like centering (DC offset removal) and amplitude scaling.
|
||||||
|
(Details in the {doc}`audio` section).
|
||||||
|
2. **Spectrogram Generation:** The prepared audio waveform is then converted into a spectrogram.
|
||||||
|
This involves calculating the Short-Time Fourier Transform (STFT) and then applying a series of configurable steps like cropping the frequency range, applying amplitude representations (like dB scale or PCEN), optional denoising, optional resizing to the model's required dimensions, and optional final normalization.
|
||||||
|
(Details in the {doc}`spectrogram` section).
|
||||||
|
|
||||||
|
The entire pipeline is controlled via settings in your main configuration file (typically a YAML file), usually grouped under a `preprocessing:` section which contains subsections like `audio:` and `spectrogram:`.
|
||||||
|
This allows you to easily define, share, and reproduce the exact preprocessing used for a specific model or experiment.
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
Explore the following sections for detailed explanations of how to configure each stage of the preprocessing pipeline and how to use the resulting preprocessor:
|
||||||
|
|
||||||
```{toctree}
|
```{toctree}
|
||||||
:maxdepth: 1
|
:maxdepth: 1
|
||||||
:caption: Contents:
|
:caption: Preprocessing Steps:
|
||||||
|
|
||||||
audio
|
audio
|
||||||
|
spectrogram
|
||||||
|
usage
|
||||||
```
|
```
|
||||||
|
141
docs/source/preprocessing/spectrogram.md
Normal file
141
docs/source/preprocessing/spectrogram.md
Normal file
@ -0,0 +1,141 @@
|
|||||||
|
# Spectrogram Generation
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
After loading and performing initial processing on the audio waveform (as described in the Audio Loading section), the next crucial step in the `preprocessing` pipeline is to convert that waveform into a **spectrogram**.
|
||||||
|
A spectrogram is a visual representation of sound, showing frequency content over time, and it's the primary input format for many deep learning models, including BatDetect2.
|
||||||
|
|
||||||
|
This module handles the calculation and subsequent processing of the spectrogram.
|
||||||
|
Just like the audio processing, these steps need to be applied **consistently** during both model training and later use (inference) to ensure the model performs reliably.
|
||||||
|
You control this entire process through the configuration file.
|
||||||
|
|
||||||
|
## The Spectrogram Generation Pipeline
|
||||||
|
|
||||||
|
Once BatDetect2 has a prepared audio waveform, it follows these steps to create the final spectrogram input for the model:
|
||||||
|
|
||||||
|
1. **Calculate STFT (Short-Time Fourier Transform):** This is the fundamental step that converts the 1D audio waveform into a 2D time-frequency representation.
|
||||||
|
It calculates the frequency content within short, overlapping time windows.
|
||||||
|
The output is typically a **magnitude spectrogram**, showing the intensity (amplitude) of different frequencies at different times.
|
||||||
|
Key parameters here are the `window_duration` and `window_overlap`, which affect the trade-off between time and frequency resolution.
|
||||||
|
2. **Crop Frequencies:** The STFT often produces frequency information over a very wide range (e.g., 0 Hz up to half the sample rate).
|
||||||
|
This step crops the spectrogram to focus only on the frequency range relevant to your target sounds (e.g., 10 kHz to 120 kHz for typical bat echolocation).
|
||||||
|
3. **Apply PCEN (Optional):** If configured, Per-Channel Energy Normalization is applied.
|
||||||
|
PCEN is an adaptive technique that adjusts the gain (loudness) in each frequency channel based on its recent history.
|
||||||
|
It can help suppress stationary background noise and enhance the prominence of transient sounds like echolocation pulses.
|
||||||
|
This step is optional.
|
||||||
|
4. **Set Amplitude Scale / Representation:** The values in the spectrogram (either raw magnitude or post-PCEN values) need to be represented on a suitable scale.
|
||||||
|
You choose one of the following:
|
||||||
|
- `"amplitude"`: Use the linear magnitude values directly.
|
||||||
|
(Default)
|
||||||
|
- `"power"`: Use the squared magnitude values (representing energy).
|
||||||
|
- `"dB"`: Apply a logarithmic transformation (specifically `log(1 + C*Magnitude)`).
|
||||||
|
This compresses the range of values, often making variations in quieter sounds more apparent, similar to how humans perceive loudness.
|
||||||
|
5. **Denoise (Optional):** If configured (and usually **on** by default), a simple noise reduction technique is applied.
|
||||||
|
This method subtracts the average value of each frequency bin (calculated across time) from that bin, assuming the average represents steady background noise.
|
||||||
|
Negative values after subtraction are clipped to zero.
|
||||||
|
6. **Resize (Optional):** If configured, the dimensions (height/frequency bins and width/time bins) of the spectrogram are adjusted using interpolation to match the exact input size expected by the neural network architecture.
|
||||||
|
7. **Peak Normalize (Optional):** If configured (typically **off** by default), the entire final spectrogram is scaled so that its highest value is exactly 1.0.
|
||||||
|
This ensures all spectrograms fed to the model have a consistent maximum value, which can sometimes aid training stability.
|
||||||
|
|
||||||
|
## Configuring Spectrogram Generation
|
||||||
|
|
||||||
|
You control all these steps via settings in your main configuration file (e.g., `config.yaml`), within the `spectrogram:` section (usually located under the main `preprocessing:` section).
|
||||||
|
|
||||||
|
Here are the key configuration options:
|
||||||
|
|
||||||
|
- **STFT Settings (`stft`)**:
|
||||||
|
|
||||||
|
- `window_duration`: (Number, seconds, e.g., `0.002`) Length of the analysis window.
|
||||||
|
- `window_overlap`: (Number, 0.0 to <1.0, e.g., `0.75`) Fractional overlap between windows.
|
||||||
|
- `window_fn`: (Text, e.g., `"hann"`) Name of the windowing function.
|
||||||
|
|
||||||
|
- **Frequency Cropping (`frequencies`)**:
|
||||||
|
|
||||||
|
- `min_freq`: (Integer, Hz, e.g., `10000`) Minimum frequency to keep.
|
||||||
|
- `max_freq`: (Integer, Hz, e.g., `120000`) Maximum frequency to keep.
|
||||||
|
|
||||||
|
- **PCEN (`pcen`)**:
|
||||||
|
|
||||||
|
- This entire section is **optional**.
|
||||||
|
Include it only if you want to apply PCEN.
|
||||||
|
If omitted or set to `null`, PCEN is skipped.
|
||||||
|
- `time_constant`: (Number, seconds, e.g., `0.4`) Controls adaptation speed.
|
||||||
|
- `gain`: (Number, e.g., `0.98`) Gain factor.
|
||||||
|
- `bias`: (Number, e.g., `2.0`) Bias factor.
|
||||||
|
- `power`: (Number, e.g., `0.5`) Compression exponent.
|
||||||
|
|
||||||
|
- **Amplitude Scale (`scale`)**:
|
||||||
|
|
||||||
|
- (Text: `"dB"`, `"power"`, or `"amplitude"`) Selects the final representation of the spectrogram values.
|
||||||
|
Default is `"amplitude"`.
|
||||||
|
|
||||||
|
- **Denoising (`spectral_mean_substraction`)**:
|
||||||
|
|
||||||
|
- (Boolean: `true` or `false`) Enables/disables the spectral mean subtraction denoising step.
|
||||||
|
Default is usually `true`.
|
||||||
|
|
||||||
|
- **Resizing (`size`)**:
|
||||||
|
|
||||||
|
- This entire section is **optional**.
|
||||||
|
Include it only if you need to resize the spectrogram to specific dimensions required by the model.
|
||||||
|
If omitted or set to `null`, no resizing occurs after frequency cropping.
|
||||||
|
- `height`: (Integer, e.g., `128`) Target number of frequency bins.
|
||||||
|
- `resize_factor`: (Number or `null`, e.g., `0.5`) Factor to scale the time dimension by.
|
||||||
|
`0.5` halves the width, `null` or `1.0` keeps the original width.
|
||||||
|
|
||||||
|
- **Peak Normalization (`peak_normalize`)**:
|
||||||
|
- (Boolean: `true` or `false`) Enables/disables final scaling of the entire spectrogram so the maximum value is 1.0.
|
||||||
|
Default is usually `false`.
|
||||||
|
|
||||||
|
**Example YAML Configuration:**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# Inside your main configuration file
|
||||||
|
|
||||||
|
preprocessing:
|
||||||
|
audio:
|
||||||
|
# ... (your audio configuration settings) ...
|
||||||
|
resample:
|
||||||
|
samplerate: 256000 # Ensure this matches model needs
|
||||||
|
|
||||||
|
spectrogram:
|
||||||
|
# --- STFT Parameters ---
|
||||||
|
stft:
|
||||||
|
window_duration: 0.002 # 2ms window
|
||||||
|
window_overlap: 0.75 # 75% overlap
|
||||||
|
window_fn: hann
|
||||||
|
|
||||||
|
# --- Frequency Range ---
|
||||||
|
frequencies:
|
||||||
|
min_freq: 10000 # 10 kHz
|
||||||
|
max_freq: 120000 # 120 kHz
|
||||||
|
|
||||||
|
# --- PCEN (Optional) ---
|
||||||
|
# Include this block to enable PCEN, omit or set to null to disable.
|
||||||
|
pcen:
|
||||||
|
time_constant: 0.4
|
||||||
|
gain: 0.98
|
||||||
|
bias: 2.0
|
||||||
|
power: 0.5
|
||||||
|
|
||||||
|
# --- Final Amplitude Representation ---
|
||||||
|
scale: dB # Choose 'dB', 'power', or 'amplitude'
|
||||||
|
|
||||||
|
# --- Denoising ---
|
||||||
|
spectral_mean_substraction: true # Enable spectral mean subtraction
|
||||||
|
|
||||||
|
# --- Resizing (Optional) ---
|
||||||
|
# Include this block to resize, omit or set to null to disable.
|
||||||
|
size:
|
||||||
|
height: 128 # Target height in frequency bins
|
||||||
|
resize_factor: 0.5 # Halve the number of time bins
|
||||||
|
|
||||||
|
# --- Final Normalization ---
|
||||||
|
peak_normalize: false # Do not scale max value to 1.0
|
||||||
|
```
|
||||||
|
|
||||||
|
## Outcome
|
||||||
|
|
||||||
|
The output of this module is the final, processed spectrogram (as a 2D numerical array with time and frequency information).
|
||||||
|
This spectrogram is now in the precise format expected by the BatDetect2 neural network, ready to be used for training the model or for making predictions on new data.
|
||||||
|
Remember, using the exact same `spectrogram` configuration settings during training and inference is essential for correct model performance.
|
175
docs/source/preprocessing/usage.md
Normal file
175
docs/source/preprocessing/usage.md
Normal file
@ -0,0 +1,175 @@
|
|||||||
|
# Using Preprocessors in BatDetect2
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
In the previous sections ({doc}`audio`and {doc}`spectrogram`), we discussed the individual steps involved in converting raw audio into a processed spectrogram suitable for BatDetect2 models, and how to configure these steps using YAML files (specifically the `audio:` and `spectrogram:` sections within a main `preprocessing:` configuration block).
|
||||||
|
|
||||||
|
This page focuses on how this configured pipeline is represented and used within BatDetect2, primarily through the concept of a **`Preprocessor`** object.
|
||||||
|
This object bundles together your chosen audio loading settings and spectrogram generation settings into a single component that can perform the end-to-end processing.
|
||||||
|
|
||||||
|
## Do I Need to Interact with Preprocessors Directly?
|
||||||
|
|
||||||
|
**Usually, no.** For standard model training or running inference with BatDetect2 using the provided scripts, the system will automatically:
|
||||||
|
|
||||||
|
1. Read your main configuration file (e.g., `config.yaml`).
|
||||||
|
2. Find the `preprocessing:` section (containing `audio:` and `spectrogram:` settings).
|
||||||
|
3. Build the appropriate `Preprocessor` object internally based on your settings.
|
||||||
|
4. Use that internal `Preprocessor` object automatically whenever audio needs to be loaded and converted to a spectrogram.
|
||||||
|
|
||||||
|
**However**, understanding the `Preprocessor` object is useful if you want to:
|
||||||
|
|
||||||
|
- **Customize:** Go beyond the standard configuration options by interacting with parts of the pipeline programmatically.
|
||||||
|
- **Integrate:** Use BatDetect2's preprocessing steps within your own custom Python analysis scripts.
|
||||||
|
- **Inspect/Debug:** Manually apply preprocessing steps to specific files or clips to examine intermediate outputs (like the processed waveform) or the final spectrogram.
|
||||||
|
|
||||||
|
## Getting a Preprocessor Object
|
||||||
|
|
||||||
|
If you _do_ want to work with the preprocessor programmatically, you first need to get an instance of it.
|
||||||
|
This is typically done based on a configuration:
|
||||||
|
|
||||||
|
1. **Define Configuration:** Create your `preprocessing:` configuration, usually in a YAML file (let's call it `preprocess_config.yaml`), detailing your desired `audio` and `spectrogram` settings.
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# preprocess_config.yaml
|
||||||
|
audio:
|
||||||
|
resample:
|
||||||
|
samplerate: 256000
|
||||||
|
# ... other audio settings ...
|
||||||
|
spectrogram:
|
||||||
|
frequencies:
|
||||||
|
min_freq: 15000
|
||||||
|
max_freq: 120000
|
||||||
|
scale: dB
|
||||||
|
# ... other spectrogram settings ...
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Load Configuration & Build Preprocessor (in Python):**
|
||||||
|
|
||||||
|
```python
|
||||||
|
from batdetect2.preprocessing import load_preprocessing_config, build_preprocessor
|
||||||
|
from batdetect2.preprocess.types import Preprocessor # Import the type
|
||||||
|
|
||||||
|
# Load the configuration from the file
|
||||||
|
config_path = "path/to/your/preprocess_config.yaml"
|
||||||
|
preprocessing_config = load_preprocessing_config(config_path)
|
||||||
|
|
||||||
|
# Build the actual preprocessor object using the loaded config
|
||||||
|
preprocessor: Preprocessor = build_preprocessor(preprocessing_config)
|
||||||
|
|
||||||
|
# 'preprocessor' is now ready to use!
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Using Defaults:** If you just want the standard BatDetect2 default preprocessing settings, you can build one without loading a config file:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from batdetect2.preprocessing import build_preprocessor
|
||||||
|
from batdetect2.preprocess.types import Preprocessor
|
||||||
|
|
||||||
|
# Build with default settings
|
||||||
|
default_preprocessor: Preprocessor = build_preprocessor()
|
||||||
|
```
|
||||||
|
|
||||||
|
## Applying Preprocessing
|
||||||
|
|
||||||
|
Once you have a `preprocessor` object, you can use its methods to process audio data:
|
||||||
|
|
||||||
|
**1.
|
||||||
|
End-to-End Processing (Common Use Case):**
|
||||||
|
|
||||||
|
These methods take an audio source identifier (file path, Recording object, or Clip object) and return the final, processed spectrogram.
|
||||||
|
|
||||||
|
- `preprocessor.preprocess_file(path)`: Processes an entire audio file.
|
||||||
|
- `preprocessor.preprocess_recording(recording_obj)`: Processes the entire audio associated with a `soundevent.data.Recording` object.
|
||||||
|
- `preprocessor.preprocess_clip(clip_obj)`: Processes only the specific time segment defined by a `soundevent.data.Clip` object.
|
||||||
|
- **Efficiency Note:** Using `preprocess_clip` is **highly recommended** when you are only interested in analyzing a small portion of a potentially long recording.
|
||||||
|
It avoids loading the entire audio file into memory, making it much more efficient.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from soundevent import data
|
||||||
|
|
||||||
|
# Assume 'preprocessor' is built as shown before
|
||||||
|
# Assume 'my_clip' is a soundevent.data.Clip object defining a segment
|
||||||
|
|
||||||
|
# Process an entire file
|
||||||
|
spectrogram_from_file = preprocessor.preprocess_file("my_recording.wav")
|
||||||
|
|
||||||
|
# Process only a specific clip (more efficient for segments)
|
||||||
|
spectrogram_from_clip = preprocessor.preprocess_clip(my_clip)
|
||||||
|
|
||||||
|
# The results (spectrogram_from_file, spectrogram_from_clip) are xr.DataArrays
|
||||||
|
print(type(spectrogram_from_clip))
|
||||||
|
# Output: <class 'xarray.core.dataarray.DataArray'>
|
||||||
|
```
|
||||||
|
|
||||||
|
**2.
|
||||||
|
Intermediate Steps (Advanced Use Cases):**
|
||||||
|
|
||||||
|
The preprocessor also allows access to intermediate stages if needed:
|
||||||
|
|
||||||
|
- `preprocessor.load_clip_audio(clip_obj)` (and similar for file/recording): Loads the audio and applies _only_ the waveform processing steps (resampling, centering, etc.) defined in the `audio` config.
|
||||||
|
Returns the processed waveform as an `xr.DataArray`.
|
||||||
|
This is useful if you want to analyze or manipulate the waveform itself before spectrogram generation.
|
||||||
|
- `preprocessor.compute_spectrogram(waveform)`: Takes an _already loaded_ waveform (either `np.ndarray` or `xr.DataArray`) and applies _only_ the spectrogram generation steps defined in the `spectrogram` config.
|
||||||
|
- If you provide an `xr.DataArray` (e.g., from `load_clip_audio`), it uses the sample rate from the array's coordinates.
|
||||||
|
- If you provide a raw `np.ndarray`, it **must assume a sample rate**.
|
||||||
|
It uses the `default_samplerate` that was determined when the `preprocessor` was built (based on your `audio` config's resample settings or the global default).
|
||||||
|
Be cautious when using NumPy arrays to ensure the sample rate assumption is correct for your data!
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Example: Get waveform first, then spectrogram
|
||||||
|
waveform = preprocessor.load_clip_audio(my_clip)
|
||||||
|
# waveform is an xr.DataArray
|
||||||
|
|
||||||
|
# ...potentially do other things with the waveform...
|
||||||
|
|
||||||
|
# Compute spectrogram from the loaded waveform
|
||||||
|
spectrogram = preprocessor.compute_spectrogram(waveform)
|
||||||
|
|
||||||
|
# Example: Process external numpy array (use with caution re: sample rate)
|
||||||
|
# import soundfile as sf # Requires installing soundfile
|
||||||
|
# numpy_waveform, original_sr = sf.read("some_other_audio.wav")
|
||||||
|
# # MUST ensure numpy_waveform's actual sample rate matches
|
||||||
|
# # preprocessor.default_samplerate for correct results here!
|
||||||
|
# spec_from_numpy = preprocessor.compute_spectrogram(numpy_waveform)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Understanding the Output: `xarray.DataArray`
|
||||||
|
|
||||||
|
All preprocessing methods return the final spectrogram (or the intermediate waveform) as an **`xarray.DataArray`**.
|
||||||
|
|
||||||
|
**What is it?** Think of it like a standard NumPy array (holding the numerical data of the spectrogram) but with added "superpowers":
|
||||||
|
|
||||||
|
- **Labeled Dimensions:** Instead of just having axis 0 and axis 1, the dimensions have names, typically `"frequency"` and `"time"`.
|
||||||
|
- **Coordinates:** It stores the actual frequency values (e.g., in Hz) corresponding to each row and the actual time values (e.g., in seconds) corresponding to each column along the dimensions.
|
||||||
|
|
||||||
|
**Why is it used?**
|
||||||
|
|
||||||
|
- **Clarity:** The data is self-describing.
|
||||||
|
You don't need to remember which axis is time and which is frequency, or what the units are – it's stored with the data.
|
||||||
|
- **Convenience:** You can select, slice, or plot data using the real-world coordinate values (times, frequencies) instead of just numerical indices.
|
||||||
|
This makes analysis code easier to write and less prone to errors.
|
||||||
|
- **Metadata:** It can hold additional metadata about the processing steps in its `attrs` (attributes) dictionary.
|
||||||
|
|
||||||
|
**Using the Output:**
|
||||||
|
|
||||||
|
- **Input to Model:** For standard training or inference, you typically pass this `xr.DataArray` spectrogram directly to the BatDetect2 model functions.
|
||||||
|
- **Inspection/Analysis:** If you're working programmatically, you can use xarray's powerful features.
|
||||||
|
For example (these are just illustrations of xarray):
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Get the shape (frequency_bins, time_bins)
|
||||||
|
# print(spectrogram.shape)
|
||||||
|
|
||||||
|
# Get the frequency coordinate values
|
||||||
|
# print(spectrogram['frequency'].values)
|
||||||
|
|
||||||
|
# Select data near a specific time and frequency
|
||||||
|
# value_at_point = spectrogram.sel(time=0.5, frequency=50000, method="nearest")
|
||||||
|
# print(value_at_point)
|
||||||
|
|
||||||
|
# Select a time slice between 0.2 and 0.3 seconds
|
||||||
|
# time_slice = spectrogram.sel(time=slice(0.2, 0.3))
|
||||||
|
# print(time_slice.shape)
|
||||||
|
```
|
||||||
|
|
||||||
|
In summary, while BatDetect2 often handles preprocessing automatically based on your configuration, the underlying `Preprocessor` object provides a flexible interface for applying these steps programmatically if needed, returning results in the convenient and informative `xarray.DataArray` format.
|
Loading…
Reference in New Issue
Block a user