Added more docs for preprocessing module

This commit is contained in:
mbsantiago 2025-04-17 16:28:48 +01:00
parent 638f93fe92
commit f314942628
3 changed files with 356 additions and 2 deletions

View File

@ -1,8 +1,46 @@
# Preprocessing Audio
# Preprocessing Audio for BatDetect2
## What is Preprocessing?
Preprocessing refers to the steps taken to transform your raw audio recordings into a standardized format suitable for analysis by the BatDetect2 deep learning model.
This module (`batdetect2.preprocessing`) provides the tools to perform these transformations.
## Why is Preprocessing Important?
Applying a consistent preprocessing pipeline is important for several reasons:
1. **Standardization:** Audio recordings vary significantly depending on the equipment used, recording conditions, and settings (e.g., different sample rates, varying loudness levels, background noise).
Preprocessing helps standardize these aspects, making the data more uniform and allowing the model to learn relevant patterns more effectively.
2. **Model Requirements:** Deep learning models, particularly those like BatDetect2 that analyze 2D-patterns in spectrograms, are designed to work with specific input characteristics.
They often expect spectrograms of a certain size (time x frequency bins), with values represented on a particular scale (e.g., logarithmic/dB), and within a defined frequency range.
Preprocessing ensures the data meets these requirements.
3. **Consistency is Key:** Perhaps most importantly, the **exact same preprocessing steps** must be applied both when _training_ the model and when _using the trained model to make predictions_ (inference) on new data.
Any discrepancy between the preprocessing used during training and inference can significantly degrade the model's performance and lead to unreliable results.
BatDetect2's configurable pipeline ensures this consistency.
## How Preprocessing is Done in BatDetect2
BatDetect2 handles preprocessing through a configurable, two-stage pipeline:
1. **Audio Loading & Preparation:** This first stage deals with the raw audio waveform.
It involves loading the specified audio segment (from a file or clip), selecting a single channel (mono), optionally resampling it to a consistent sample rate, optionally adjusting its duration, and applying basic waveform conditioning like centering (DC offset removal) and amplitude scaling.
(Details in the {doc}`audio` section).
2. **Spectrogram Generation:** The prepared audio waveform is then converted into a spectrogram.
This involves calculating the Short-Time Fourier Transform (STFT) and then applying a series of configurable steps like cropping the frequency range, applying amplitude representations (like dB scale or PCEN), optional denoising, optional resizing to the model's required dimensions, and optional final normalization.
(Details in the {doc}`spectrogram` section).
The entire pipeline is controlled via settings in your main configuration file (typically a YAML file), usually grouped under a `preprocessing:` section which contains subsections like `audio:` and `spectrogram:`.
This allows you to easily define, share, and reproduce the exact preprocessing used for a specific model or experiment.
## Next Steps
Explore the following sections for detailed explanations of how to configure each stage of the preprocessing pipeline and how to use the resulting preprocessor:
```{toctree}
:maxdepth: 1
:caption: Contents:
:caption: Preprocessing Steps:
audio
spectrogram
usage
```

View File

@ -0,0 +1,141 @@
# Spectrogram Generation
## Purpose
After loading and performing initial processing on the audio waveform (as described in the Audio Loading section), the next crucial step in the `preprocessing` pipeline is to convert that waveform into a **spectrogram**.
A spectrogram is a visual representation of sound, showing frequency content over time, and it's the primary input format for many deep learning models, including BatDetect2.
This module handles the calculation and subsequent processing of the spectrogram.
Just like the audio processing, these steps need to be applied **consistently** during both model training and later use (inference) to ensure the model performs reliably.
You control this entire process through the configuration file.
## The Spectrogram Generation Pipeline
Once BatDetect2 has a prepared audio waveform, it follows these steps to create the final spectrogram input for the model:
1. **Calculate STFT (Short-Time Fourier Transform):** This is the fundamental step that converts the 1D audio waveform into a 2D time-frequency representation.
It calculates the frequency content within short, overlapping time windows.
The output is typically a **magnitude spectrogram**, showing the intensity (amplitude) of different frequencies at different times.
Key parameters here are the `window_duration` and `window_overlap`, which affect the trade-off between time and frequency resolution.
2. **Crop Frequencies:** The STFT often produces frequency information over a very wide range (e.g., 0 Hz up to half the sample rate).
This step crops the spectrogram to focus only on the frequency range relevant to your target sounds (e.g., 10 kHz to 120 kHz for typical bat echolocation).
3. **Apply PCEN (Optional):** If configured, Per-Channel Energy Normalization is applied.
PCEN is an adaptive technique that adjusts the gain (loudness) in each frequency channel based on its recent history.
It can help suppress stationary background noise and enhance the prominence of transient sounds like echolocation pulses.
This step is optional.
4. **Set Amplitude Scale / Representation:** The values in the spectrogram (either raw magnitude or post-PCEN values) need to be represented on a suitable scale.
You choose one of the following:
- `"amplitude"`: Use the linear magnitude values directly.
(Default)
- `"power"`: Use the squared magnitude values (representing energy).
- `"dB"`: Apply a logarithmic transformation (specifically `log(1 + C*Magnitude)`).
This compresses the range of values, often making variations in quieter sounds more apparent, similar to how humans perceive loudness.
5. **Denoise (Optional):** If configured (and usually **on** by default), a simple noise reduction technique is applied.
This method subtracts the average value of each frequency bin (calculated across time) from that bin, assuming the average represents steady background noise.
Negative values after subtraction are clipped to zero.
6. **Resize (Optional):** If configured, the dimensions (height/frequency bins and width/time bins) of the spectrogram are adjusted using interpolation to match the exact input size expected by the neural network architecture.
7. **Peak Normalize (Optional):** If configured (typically **off** by default), the entire final spectrogram is scaled so that its highest value is exactly 1.0.
This ensures all spectrograms fed to the model have a consistent maximum value, which can sometimes aid training stability.
## Configuring Spectrogram Generation
You control all these steps via settings in your main configuration file (e.g., `config.yaml`), within the `spectrogram:` section (usually located under the main `preprocessing:` section).
Here are the key configuration options:
- **STFT Settings (`stft`)**:
- `window_duration`: (Number, seconds, e.g., `0.002`) Length of the analysis window.
- `window_overlap`: (Number, 0.0 to <1.0, e.g., `0.75`) Fractional overlap between windows.
- `window_fn`: (Text, e.g., `"hann"`) Name of the windowing function.
- **Frequency Cropping (`frequencies`)**:
- `min_freq`: (Integer, Hz, e.g., `10000`) Minimum frequency to keep.
- `max_freq`: (Integer, Hz, e.g., `120000`) Maximum frequency to keep.
- **PCEN (`pcen`)**:
- This entire section is **optional**.
Include it only if you want to apply PCEN.
If omitted or set to `null`, PCEN is skipped.
- `time_constant`: (Number, seconds, e.g., `0.4`) Controls adaptation speed.
- `gain`: (Number, e.g., `0.98`) Gain factor.
- `bias`: (Number, e.g., `2.0`) Bias factor.
- `power`: (Number, e.g., `0.5`) Compression exponent.
- **Amplitude Scale (`scale`)**:
- (Text: `"dB"`, `"power"`, or `"amplitude"`) Selects the final representation of the spectrogram values.
Default is `"amplitude"`.
- **Denoising (`spectral_mean_substraction`)**:
- (Boolean: `true` or `false`) Enables/disables the spectral mean subtraction denoising step.
Default is usually `true`.
- **Resizing (`size`)**:
- This entire section is **optional**.
Include it only if you need to resize the spectrogram to specific dimensions required by the model.
If omitted or set to `null`, no resizing occurs after frequency cropping.
- `height`: (Integer, e.g., `128`) Target number of frequency bins.
- `resize_factor`: (Number or `null`, e.g., `0.5`) Factor to scale the time dimension by.
`0.5` halves the width, `null` or `1.0` keeps the original width.
- **Peak Normalization (`peak_normalize`)**:
- (Boolean: `true` or `false`) Enables/disables final scaling of the entire spectrogram so the maximum value is 1.0.
Default is usually `false`.
**Example YAML Configuration:**
```yaml
# Inside your main configuration file
preprocessing:
audio:
# ... (your audio configuration settings) ...
resample:
samplerate: 256000 # Ensure this matches model needs
spectrogram:
# --- STFT Parameters ---
stft:
window_duration: 0.002 # 2ms window
window_overlap: 0.75 # 75% overlap
window_fn: hann
# --- Frequency Range ---
frequencies:
min_freq: 10000 # 10 kHz
max_freq: 120000 # 120 kHz
# --- PCEN (Optional) ---
# Include this block to enable PCEN, omit or set to null to disable.
pcen:
time_constant: 0.4
gain: 0.98
bias: 2.0
power: 0.5
# --- Final Amplitude Representation ---
scale: dB # Choose 'dB', 'power', or 'amplitude'
# --- Denoising ---
spectral_mean_substraction: true # Enable spectral mean subtraction
# --- Resizing (Optional) ---
# Include this block to resize, omit or set to null to disable.
size:
height: 128 # Target height in frequency bins
resize_factor: 0.5 # Halve the number of time bins
# --- Final Normalization ---
peak_normalize: false # Do not scale max value to 1.0
```
## Outcome
The output of this module is the final, processed spectrogram (as a 2D numerical array with time and frequency information).
This spectrogram is now in the precise format expected by the BatDetect2 neural network, ready to be used for training the model or for making predictions on new data.
Remember, using the exact same `spectrogram` configuration settings during training and inference is essential for correct model performance.

View File

@ -0,0 +1,175 @@
# Using Preprocessors in BatDetect2
## Overview
In the previous sections ({doc}`audio`and {doc}`spectrogram`), we discussed the individual steps involved in converting raw audio into a processed spectrogram suitable for BatDetect2 models, and how to configure these steps using YAML files (specifically the `audio:` and `spectrogram:` sections within a main `preprocessing:` configuration block).
This page focuses on how this configured pipeline is represented and used within BatDetect2, primarily through the concept of a **`Preprocessor`** object.
This object bundles together your chosen audio loading settings and spectrogram generation settings into a single component that can perform the end-to-end processing.
## Do I Need to Interact with Preprocessors Directly?
**Usually, no.** For standard model training or running inference with BatDetect2 using the provided scripts, the system will automatically:
1. Read your main configuration file (e.g., `config.yaml`).
2. Find the `preprocessing:` section (containing `audio:` and `spectrogram:` settings).
3. Build the appropriate `Preprocessor` object internally based on your settings.
4. Use that internal `Preprocessor` object automatically whenever audio needs to be loaded and converted to a spectrogram.
**However**, understanding the `Preprocessor` object is useful if you want to:
- **Customize:** Go beyond the standard configuration options by interacting with parts of the pipeline programmatically.
- **Integrate:** Use BatDetect2's preprocessing steps within your own custom Python analysis scripts.
- **Inspect/Debug:** Manually apply preprocessing steps to specific files or clips to examine intermediate outputs (like the processed waveform) or the final spectrogram.
## Getting a Preprocessor Object
If you _do_ want to work with the preprocessor programmatically, you first need to get an instance of it.
This is typically done based on a configuration:
1. **Define Configuration:** Create your `preprocessing:` configuration, usually in a YAML file (let's call it `preprocess_config.yaml`), detailing your desired `audio` and `spectrogram` settings.
```yaml
# preprocess_config.yaml
audio:
resample:
samplerate: 256000
# ... other audio settings ...
spectrogram:
frequencies:
min_freq: 15000
max_freq: 120000
scale: dB
# ... other spectrogram settings ...
```
2. **Load Configuration & Build Preprocessor (in Python):**
```python
from batdetect2.preprocessing import load_preprocessing_config, build_preprocessor
from batdetect2.preprocess.types import Preprocessor # Import the type
# Load the configuration from the file
config_path = "path/to/your/preprocess_config.yaml"
preprocessing_config = load_preprocessing_config(config_path)
# Build the actual preprocessor object using the loaded config
preprocessor: Preprocessor = build_preprocessor(preprocessing_config)
# 'preprocessor' is now ready to use!
```
3. **Using Defaults:** If you just want the standard BatDetect2 default preprocessing settings, you can build one without loading a config file:
```python
from batdetect2.preprocessing import build_preprocessor
from batdetect2.preprocess.types import Preprocessor
# Build with default settings
default_preprocessor: Preprocessor = build_preprocessor()
```
## Applying Preprocessing
Once you have a `preprocessor` object, you can use its methods to process audio data:
**1.
End-to-End Processing (Common Use Case):**
These methods take an audio source identifier (file path, Recording object, or Clip object) and return the final, processed spectrogram.
- `preprocessor.preprocess_file(path)`: Processes an entire audio file.
- `preprocessor.preprocess_recording(recording_obj)`: Processes the entire audio associated with a `soundevent.data.Recording` object.
- `preprocessor.preprocess_clip(clip_obj)`: Processes only the specific time segment defined by a `soundevent.data.Clip` object.
- **Efficiency Note:** Using `preprocess_clip` is **highly recommended** when you are only interested in analyzing a small portion of a potentially long recording.
It avoids loading the entire audio file into memory, making it much more efficient.
```python
from soundevent import data
# Assume 'preprocessor' is built as shown before
# Assume 'my_clip' is a soundevent.data.Clip object defining a segment
# Process an entire file
spectrogram_from_file = preprocessor.preprocess_file("my_recording.wav")
# Process only a specific clip (more efficient for segments)
spectrogram_from_clip = preprocessor.preprocess_clip(my_clip)
# The results (spectrogram_from_file, spectrogram_from_clip) are xr.DataArrays
print(type(spectrogram_from_clip))
# Output: <class 'xarray.core.dataarray.DataArray'>
```
**2.
Intermediate Steps (Advanced Use Cases):**
The preprocessor also allows access to intermediate stages if needed:
- `preprocessor.load_clip_audio(clip_obj)` (and similar for file/recording): Loads the audio and applies _only_ the waveform processing steps (resampling, centering, etc.) defined in the `audio` config.
Returns the processed waveform as an `xr.DataArray`.
This is useful if you want to analyze or manipulate the waveform itself before spectrogram generation.
- `preprocessor.compute_spectrogram(waveform)`: Takes an _already loaded_ waveform (either `np.ndarray` or `xr.DataArray`) and applies _only_ the spectrogram generation steps defined in the `spectrogram` config.
- If you provide an `xr.DataArray` (e.g., from `load_clip_audio`), it uses the sample rate from the array's coordinates.
- If you provide a raw `np.ndarray`, it **must assume a sample rate**.
It uses the `default_samplerate` that was determined when the `preprocessor` was built (based on your `audio` config's resample settings or the global default).
Be cautious when using NumPy arrays to ensure the sample rate assumption is correct for your data!
```python
# Example: Get waveform first, then spectrogram
waveform = preprocessor.load_clip_audio(my_clip)
# waveform is an xr.DataArray
# ...potentially do other things with the waveform...
# Compute spectrogram from the loaded waveform
spectrogram = preprocessor.compute_spectrogram(waveform)
# Example: Process external numpy array (use with caution re: sample rate)
# import soundfile as sf # Requires installing soundfile
# numpy_waveform, original_sr = sf.read("some_other_audio.wav")
# # MUST ensure numpy_waveform's actual sample rate matches
# # preprocessor.default_samplerate for correct results here!
# spec_from_numpy = preprocessor.compute_spectrogram(numpy_waveform)
```
## Understanding the Output: `xarray.DataArray`
All preprocessing methods return the final spectrogram (or the intermediate waveform) as an **`xarray.DataArray`**.
**What is it?** Think of it like a standard NumPy array (holding the numerical data of the spectrogram) but with added "superpowers":
- **Labeled Dimensions:** Instead of just having axis 0 and axis 1, the dimensions have names, typically `"frequency"` and `"time"`.
- **Coordinates:** It stores the actual frequency values (e.g., in Hz) corresponding to each row and the actual time values (e.g., in seconds) corresponding to each column along the dimensions.
**Why is it used?**
- **Clarity:** The data is self-describing.
You don't need to remember which axis is time and which is frequency, or what the units are it's stored with the data.
- **Convenience:** You can select, slice, or plot data using the real-world coordinate values (times, frequencies) instead of just numerical indices.
This makes analysis code easier to write and less prone to errors.
- **Metadata:** It can hold additional metadata about the processing steps in its `attrs` (attributes) dictionary.
**Using the Output:**
- **Input to Model:** For standard training or inference, you typically pass this `xr.DataArray` spectrogram directly to the BatDetect2 model functions.
- **Inspection/Analysis:** If you're working programmatically, you can use xarray's powerful features.
For example (these are just illustrations of xarray):
```python
# Get the shape (frequency_bins, time_bins)
# print(spectrogram.shape)
# Get the frequency coordinate values
# print(spectrogram['frequency'].values)
# Select data near a specific time and frequency
# value_at_point = spectrogram.sel(time=0.5, frequency=50000, method="nearest")
# print(value_at_point)
# Select a time slice between 0.2 and 0.3 seconds
# time_slice = spectrogram.sel(time=slice(0.2, 0.3))
# print(time_slice.shape)
```
In summary, while BatDetect2 often handles preprocessing automatically based on your configuration, the underlying `Preprocessor` object provides a flexible interface for applying these steps programmatically if needed, returning results in the convenient and informative `xarray.DataArray` format.