Bird Sound Classifier

Birds

Full code can be accessed at the Github repository

Bird Sound Classifier

fast.ai’s courses and software make it extremely easy to start working on difficult projects very quickly. This is just another example of that.

This is a Bird Sound Classifying Deep Learning model, which takes in bird sounds, converts them into images (spectograms), and then classifies those images based on what type of bird call it is.

The data is from: https://datadryad.org/resource/doi:10.5061/dryad.4g8b7/1

There are 6 types of bird calls: distance,hat,kackle,song,stack,tet.

This model gets around 80% accuracy, which is not bad at all for something that relies on so many different factors.

Libraries

The necessary libraries and functions have to be imported:

from fastai.vision import *
%reload_ext autoreload
%autoreload 2
%matplotlib inline
from fastai import *
import matplotlib.pyplot as plt
from matplotlib.pyplot import specgram
import librosa
import numpy as np
import librosa.display

Since this was done on Google’s Colab environment, it is necessary to link up Google Drive to the project, so that the data can be imported.

from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"
base_dir = root_dir + 'bird-recognition'
Mounted at /content/gdrive
path = Path(base_dir+'/wav_files_playback')

Creating spectograms:

Next, a very simple function, create_fold_spectrograms, which takes in the folder name as input, and creates spectrograms in corresponding folders in a seperate path. This uses the librosa package. The code is similar to the one used in: https://github.com/etown/dl1/blob/master/UrbanSoundClassification.ipynb

def create_fold_spectrograms(folder):
    spectrogram_path = Path(base_dir+'/specto')
    audio_path = path
    os.makedirs(spectrogram_path/folder,exist_ok=True)
    for audio_file in list(Path(audio_path/f'{folder}').glob('*.wav')):
        samples, sample_rate = librosa.load(audio_file)
        fig = plt.figure(figsize=[0.72,0.72])
        ax = fig.add_subplot(111)
        ax.axes.get_xaxis().set_visible(False)
        ax.axes.get_yaxis().set_visible(False)
        ax.set_frame_on(False)
        filename  = spectrogram_path/folder/Path(audio_file).name.replace('.wav','.png')
        S = librosa.feature.melspectrogram(y=samples, sr=sample_rate)
        librosa.display.specshow(librosa.power_to_db(S, ref=np.max))
        plt.savefig(filename, dpi=400, bbox_inches='tight',pad_inches=0)
        plt.close('all')
folds=['distance','hat','kackle','song','stack','tet']
for i in folds:
  create_fold_spectrograms(str(i))

Data Bunch

Once, the sound files are converted into image files, the data can be extracted from the folders and seperated into training and validation sets.

np.random.seed(42)
spectrogram_path = Path(base_dir+'/specto')
tfms = get_transforms(do_flip=False)
# don't use any transformations because it doesn't make sense in the case of a spectrogram
# i.e. flipping a spectrogram changes the meaning
data = ImageDataBunch.from_folder(spectrogram_path, train=".", ds_tfms=tfms, valid_pct=0.2, size=224)
data.normalize(imagenet_stats)
ImageDataBunch;

Train: LabelList (152 items)
x: ImageList
Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224)
y: CategoryList
distance,distance,distance,distance,distance
Path: /content/gdrive/My Drive/bird-recognition/specto;

Valid: LabelList (37 items)
x: ImageList
Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224)
y: CategoryList
tet,tet,distance,distance,hat
Path: /content/gdrive/My Drive/bird-recognition/specto;

Test: None
data.show_batch(rows=3,figsize=(7,7))
data.classes, data.c, len(data.train_ds), len(data.valid_ds)
(['distance', 'hat', 'kackle', 'song', 'stack', 'tet'], 6, 152, 37)

We can see that there are 6 different classes, and a good split between training and validation sets.

Training

Next, cnn_learner can be used, with a resnet34 architecture, to train the model:

learn = cnn_learner(data, models.resnet34, metrics=[error_rate,accuracy])
Downloading: "https://download.pytorch.org/models/resnet34-333f7ec4.pth" to /root/.cache/torch/checkpoints/resnet34-333f7ec4.pth
100%|██████████| 87306240/87306240 [00:00<00:00, 101948611.46it/s]
learn.fit_one_cycle(6,max_lr=slice(3e-03))
epoch train_loss valid_loss error_rate accuracy time
0 0.438323 0.588238 0.189189 0.810811 00:03
1 0.361946 0.716108 0.324324 0.675676 00:03
2 0.333349 1.141138 0.297297 0.702703 00:03
3 0.290415 1.483750 0.297297 0.702703 00:03
4 0.271594 1.513314 0.324324 0.675676 00:03
5 0.240999 1.303614 0.297297 0.702703 00:03

There’s only about 70% accuracy, which can be made higher with the right learning rate:

learn.lr_find()
learn.recorder.plot()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
learn.unfreeze()
learn.fit_one_cycle(6,max_lr=slice(3e-03))
epoch train_loss valid_loss error_rate accuracy time
0 0.089979 1.076153 0.324324 0.675676 00:03
1 0.085291 0.820889 0.243243 0.756757 00:03
2 0.080007 0.758169 0.189189 0.810811 00:03
3 0.099773 0.824883 0.216216 0.783784 00:03
4 0.106347 0.963399 0.243243 0.756757 00:03
5 0.101405 0.916323 0.216216 0.783784 00:03

78% accuracy is the final accuracy. With some tinkering, this can be increased to slightly above 80% as well.

Interpreting Results:

Using the ClassificationInterpretation function, the results of the training model can be interepreted:

interp=ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()
interp.most_confused(min_val=2)
[('tet', 'hat', 5), ('tet', 'stack', 2)]

From this, it is evident that tet is the one causing the most problem, with it being misclassified 7 times, and only correctly classified once. Otherwise, the model is almost fully accurate.


Vishnu Bharadwaj
Vishnu Bharadwaj
Data Scientist

Budding Data Scientist who loves to learn and write.

Related