Bird Sound Classifier

Sep 24, 2019

Birds

Full code can be accessed at the Github repository

Bird Sound Classifier

fast.ai’s courses and software make it extremely easy to start working on difficult projects very quickly. This is just another example of that.

This is a Bird Sound Classifying Deep Learning model, which takes in bird sounds, converts them into images (spectograms), and then classifies those images based on what type of bird call it is.

The data is from: https://datadryad.org/resource/doi:10.5061/dryad.4g8b7/1

There are 6 types of bird calls: distance,hat,kackle,song,stack,tet.

This model gets around 80% accuracy, which is not bad at all for something that relies on so many different factors.

Libraries

The necessary libraries and functions have to be imported:

from fastai.vision import *
%reload_ext autoreload
%autoreload 2
%matplotlib inline
from fastai import *
import matplotlib.pyplot as plt
from matplotlib.pyplot import specgram
import librosa
import numpy as np
import librosa.display

Since this was done on Google’s Colab environment, it is necessary to link up Google Drive to the project, so that the data can be imported.

from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"
base_dir = root_dir + 'bird-recognition'

Mounted at /content/gdrive

path = Path(base_dir+'/wav_files_playback')

Creating spectograms:

Next, a very simple function, create_fold_spectrograms, which takes in the folder name as input, and creates spectrograms in corresponding folders in a seperate path. This uses the librosa package. The code is similar to the one used in: https://github.com/etown/dl1/blob/master/UrbanSoundClassification.ipynb

def create_fold_spectrograms(folder):
    spectrogram_path = Path(base_dir+'/specto')
    audio_path = path
    os.makedirs(spectrogram_path/folder,exist_ok=True)
    for audio_file in list(Path(audio_path/f'{folder}').glob('*.wav')):
        samples, sample_rate = librosa.load(audio_file)
        fig = plt.figure(figsize=[0.72,0.72])
        ax = fig.add_subplot(111)
        ax.axes.get_xaxis().set_visible(False)
        ax.axes.get_yaxis().set_visible(False)
        ax.set_frame_on(False)
        filename  = spectrogram_path/folder/Path(audio_file).name.replace('.wav','.png')
        S = librosa.feature.melspectrogram(y=samples, sr=sample_rate)
        librosa.display.specshow(librosa.power_to_db(S, ref=np.max))
        plt.savefig(filename, dpi=400, bbox_inches='tight',pad_inches=0)
        plt.close('all')

folds=['distance','hat','kackle','song','stack','tet']

for i in folds:
  create_fold_spectrograms(str(i))

Data Bunch

Once, the sound files are converted into image files, the data can be extracted from the folders and seperated into training and validation sets.

np.random.seed(42)
spectrogram_path = Path(base_dir+'/specto')
tfms = get_transforms(do_flip=False)
# don't use any transformations because it doesn't make sense in the case of a spectrogram
# i.e. flipping a spectrogram changes the meaning
data = ImageDataBunch.from_folder(spectrogram_path, train=".", ds_tfms=tfms, valid_pct=0.2, size=224)
data.normalize(imagenet_stats)

ImageDataBunch;

Train: LabelList (152 items)
x: ImageList
Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224)
y: CategoryList
distance,distance,distance,distance,distance
Path: /content/gdrive/My Drive/bird-recognition/specto;

Valid: LabelList (37 items)
x: ImageList
Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224),Image (3, 224, 224)
y: CategoryList
tet,tet,distance,distance,hat
Path: /content/gdrive/My Drive/bird-recognition/specto;

Test: None

data.show_batch(rows=3,figsize=(7,7))

data.classes, data.c, len(data.train_ds), len(data.valid_ds)

(['distance', 'hat', 'kackle', 'song', 'stack', 'tet'], 6, 152, 37)

We can see that there are 6 different classes, and a good split between training and validation sets.

Training

Next, cnn_learner can be used, with a resnet34 architecture, to train the model:

learn = cnn_learner(data, models.resnet34, metrics=[error_rate,accuracy])

Downloading: "https://download.pytorch.org/models/resnet34-333f7ec4.pth" to /root/.cache/torch/checkpoints/resnet34-333f7ec4.pth
100%|██████████| 87306240/87306240 [00:00<00:00, 101948611.46it/s]

learn.fit_one_cycle(6,max_lr=slice(3e-03))

epoch	train_loss	valid_loss	error_rate	accuracy	time
0	0.438323	0.588238	0.189189	0.810811	00:03
1	0.361946	0.716108	0.324324	0.675676	00:03
2	0.333349	1.141138	0.297297	0.702703	00:03
3	0.290415	1.483750	0.297297	0.702703	00:03
4	0.271594	1.513314	0.324324	0.675676	00:03
5	0.240999	1.303614	0.297297	0.702703	00:03

There’s only about 70% accuracy, which can be made higher with the right learning rate:

learn.lr_find()
learn.recorder.plot()

LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.

learn.unfreeze()
learn.fit_one_cycle(6,max_lr=slice(3e-03))

epoch	train_loss	valid_loss	error_rate	accuracy	time
0	0.089979	1.076153	0.324324	0.675676	00:03
1	0.085291	0.820889	0.243243	0.756757	00:03
2	0.080007	0.758169	0.189189	0.810811	00:03
3	0.099773	0.824883	0.216216	0.783784	00:03
4	0.106347	0.963399	0.243243	0.756757	00:03
5	0.101405	0.916323	0.216216	0.783784	00:03

78% accuracy is the final accuracy. With some tinkering, this can be increased to slightly above 80% as well.

Interpreting Results:

Using the ClassificationInterpretation function, the results of the training model can be interepreted:

interp=ClassificationInterpretation.from_learner(learn)

interp.plot_confusion_matrix()

interp.most_confused(min_val=2)

[('tet', 'hat', 5), ('tet', 'stack', 2)]

From this, it is evident that tet is the one causing the most problem, with it being misclassified 7 times, and only correctly classified once. Otherwise, the model is almost fully accurate.

Deep Learning

Vishnu Bharadwaj

Data Scientist

Budding Data Scientist who loves to learn and write.

Bird Sound Classifier

Bird Sound Classifier

Vishnu Bharadwaj

Data Scientist

Related