Use of AI

In bioacoustics area, it is quite common to have vast amounts of acoustic recordings over extended periods. Manual analysis of this data can be time-consuming and labor-intensive. Automatic models enable efficient processing of large datasets, reducing the time and effort required for analysis.

An automatic detection of animal vocalizations can be achieved using a machine learning or deep learning model trained to identify and classify specific vocalizations from audio recordings. These models are designed to automatically process large volumes of audio data and recognize specific patterns, or features, associated with animal vocalizations.

Automatic models can be deployed for real-time monitoring of marine mammal vocalizations, allowing researchers to detect and track their presence in near real-time. This capability is valuable for studying the behavior and habitat use of marine mammals and for informing conservation and management efforts.

As with any artificially intelligence model, the performance is driven by the quality, reliability and size of the dataset used to train the model. To facilitate this, labelled datasets must be created, containing accurately labeled information relevant to the sounds of interest.


The sounds of interest for the current model, hereafter referred to as classes, were defined according to the label categories presented in Figure 4 and Figure 5. Each class will be further detailed in the next section of this document.




How the AI model is developed


 The AI model used in this work is based on the Deep Learning architecture: Convolutional Neural Networks (CNNs).

CNNs are a subset of Neural Networks (NN) that work particularly well for image format data because it can extract features from the pixels information, while keeping in memory their relative position in the image.
The sound recordings from a hydrophone, transformed into an image by calculating their spectrogram, can be used as the input of a CNN architecture. blueOASIS uses several other steps of pre-processing to highlight patterns from the spectrogram, leading to new images, that are also used inputs to the model. The CNN then returns the probability of presence of each class.

The accuracy of the model depends on the quality of the training and on the quality of the labeled dataset. The training process can be summarized in three steps:

  • Step 1: Processing the recording
 The preprocessing is done using python, by loading the wav file, using the python library librosa. The wav is converted into a spectrogram in Mel scale and in dB. This spectrogram (as any image) corresponds to a matrix of size (60,51*time) where each row represents a specific frequency and each column a very short period of time. The image is split into smaller one second windows of size (60,51) that can be used in the algorithm.

  • Step 2: Matching of 1s windows with the labels from Audacity, i.e., the ground-truth
 For each 1s window, the algorithm cross-reference if there is a corresponding label, and if so it assigns the respective class to the spectrogram window. Otherwise, the classe “Background” is assigned.

  •  Step 3: Data splitting
 As a result of the previous step, the dataset will consist of a set of spectrograms with a certain class associated to each of them. However, the number of dolphins occurrence is much smaller than vessels or background. To train an unbiased model, the classes with the most occurrences need to be under-sampled to reach a so-called balanced dataset. This is a necessary step to improve the model performance despite the significant decrease of the dataset size.
To train and assess the result of the model, the under-sampled dataset is split into a training set (85% of the data) and a final test set (15%). The final test set is an independent dataset used to assess the performance of different models within a fair comparison. From the training set, around 20% are used for validation during the training. This means, only around 68% of the under-sampled data is used for training.

  •  Step 4: Training of the algorithm
 An untrained model is given as inputs the spectrograms and their ground-truth label. After passing through the network the entire training set, the model should be able to differentiate the classes. To test it, the final test set spectrograms are used as input to the trained model, which returns a vector of class probabilities for each class used. Finally, these vectors are compared with the ground-truth labels to determine the model's accuracy.
 


The acoustic events that will be labelled

The acoustic classes mentioned above will be identify in the acoustic files and it will be used a classse code to decrease spelling mistakes between different labellers.
 The following classes will be labelled in the audio files as follows:
Title
Title
Whistles
WHI
Burst Pulse Sounds
BPS
Gulps
GUL
Grunts
GRU
Creaks
CRE
Squawks
SAW
Squeaks
SEA
Natural Unknown Event
NUE
Low-Frequency vessel
LFV
Medium-Frequency Vessel
MFV
High-Frequency Vessel
HFV
LLoyd's Mirror Effect
LME
Ping
PIN
Anthropogenic Unknown Event
AUE
Parasitic Acoustic Noise
PAN
Bad Quality Unknown
BQU