In bioacoustics area, it is quite common to have vast amounts of acoustic recordings over extended periods. Manual analysis of this data can be time-consuming and labor-intensive. Automatic models enable efficient processing of large datasets, reducing the time and effort required for analysis.
of animal vocalizations can be achieved using a machine learning or deep learning model trained to identify and classify specific vocalizations from audio recordings. These models are designed to automatically process large volumes of audio data and recognize specific patterns, or features, associated with animal vocalizations.
s can be deployed for real-time monitoring of allowing researchers to detect and track their presence in near real-time. This capability is valuable for studying the behavior and habitat use of marine mammals and for informing conservation and management efforts.
As with any the performance is driven by the quality, reliability and size of the dataset used to train the model. To facilitate this, labelled datasets must be created, containing accurately labeled information relevant to the.
The sounds of interest for the current model, hereafter referred to as classes, were defined according to the presented in . Each class will be further detailed in the next section of this document.
Figure 4-Classes of marine life sounds considered in this work.
Figure 5- Classes of anthropogenic sounds considered in this work.
The AI model used in this work is based on the : Convolutional Neural Networks (CNNs).
CNNs are a subset of Neural Networks (NN) that work particularly well for image format data because it can extract features from the pixels information, while keeping in memory their relative position in the image.
The sound recordings from a hydrophone, transformed into an image by calculating their spectrogram, can be used as the input of a CNN architecture. blueOASIS uses several other steps of pre-processing to highlight patterns from the spectrogram, leading to new images, that are also used inputs to the model. The CNN then returns the probability of presence of each class.
The accuracy of the model depends on the and on the quality of the labeled dataset. The training process can be summarized in three steps:
The preprocessing is done using python, by loading the wav file, using the python library librosa. The wav is converted into a spectrogram in Mel scale and in dB. This spectrogram (as any image) corresponds to a matrix of size (60,51*time) where each row represents a specific frequency and each column a very short period of time. The image is split into smaller one second windows of size (60,51) that can be used in the algorithm.
For each 1s window, the algorithm cross-reference if there is a corresponding label, and if so it assigns the respective class to the spectrogram window. Otherwise, the classe “Background” is assigned.
As a result of the previous step, the dataset will consist of a set of spectrograms with a certain class associated to each of them. However, the number of dolphins occurrence is much smaller than vessels or background. To train an unbiased model, the classes with the most occurrences need to be under-sampled to reach a so-called balanced dataset. This is a necessary step to improve the model performance despite the significant decrease of the dataset size.
To train and assess the result of the model, the under-sampled dataset is split into a training set (85% of the data) and a final test set (15%). The final test set is an independent dataset used to assess the performance of different models within a fair comparison. From the training set, around 20% are used for validation during the training. This means, only around 68% of the under-sampled data is used for training.
An untrained model is given as inputs the spectrograms and their ground-truth label. After passing through the network the entire training set, the model should be able to differentiate the classes. To test it, the final test set spectrograms are used as input to the trained model, which returns a vector of class probabilities for each class used. Finally, these vectors are compared with the ground-truth labels to determine the model's accuracy.
The acoustic classes mentioned above will be identify in the acoustic files and it will be used a classse code to decrease spelling mistakes between different labellers.
The following classes will be labelled in the audio files as follows:
Anthropogenic Unknown Event