Speech Recognition with Neural Network

a classifier to recognize 30 common words

Posted on May 01, 2018

Built an algorithm that understands simple spoken commands from the English language with Convolutional Neural Network. Preprocessing is done to convert the audio files to spectogram to make the training easier for the neural network. Over 60,000 samples were used to train 30 words for an accuracy of 85%.

Getting the dataset

You can download the audio files from the Kaggle Competition. Considering that the competition is now over, I did not go by the rules of the competition and deviated from the end goal. So this is not a solution for the competition.

The Script

Importing Libraries
import matplotlib.pyplot as plt
from scipy import signal
from scipy.io import wavfile
import os
os.environ['KERAS_BACKEND'] = 'tensorflow'
import numpy as np
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
from keras.preprocessing.image import ImageDataGenerator, load_img, img_to_array
import shutil

We convert the audio files downloaded from Kaggle to spectograms. And to store them in a similar path as that of the audio files, we create a copy of the directories.

Making directories to save spectograms
audio_path = 'train/audio/'
pict_Path = 'train/train_pics/'

subFolderList = []
for x in os.listdir(audio_path):
    if os.path.isdir(audio_path + '/' + x):
        subFolderList.append(x)

if not os.path.exists(pict_Path):
    os.makedirs(pict_Path)

subFolderList = []
for x in os.listdir(audio_path):
    if os.path.isdir(audio_path + '/' + x):
        subFolderList.append(x)
        if not os.path.exists(pict_Path + '/' + x):
            os.makedirs(pict_Path +'/'+ x)
Sample file from each Class
sample_audio = []
total = 0
for x in subFolderList:
    all_files = [y for y in os.listdir(audio_path + x) if '.wav' in y]
    total += len(all_files)
    sample_audio.append(audio_path  + x + '/'+ all_files[0])

print('count: %d : %s' % (len(all_files), x ))
print(total)
Speech count
Function to convert audio to spectogram
def log_specgram(audio, sample_rate, window_size=20,
                 step_size=10, eps=1e-10):
    nperseg = int(round(window_size * sample_rate / 1e3))
    noverlap = int(round(step_size * sample_rate / 1e3))
    freqs, _, spec = signal.spectrogram(audio,
                                    fs=sample_rate,
                                    window='hann',
                                    nperseg=nperseg,
                                    noverlap=noverlap,
                                    detrend=False)
    return freqs, np.log(spec.T.astype(np.float32) + eps)
Visualizing samples in Spectograms
fig = plt.figure(figsize=(10,10))
for i, filepath in enumerate(sample_audio[:9]):
    plt.subplot(3,3,i+1)
    label = filepath.split('/')[-2]
    plt.title(label)
    samplerate, test_sound  = wavfile.read(filepath)
    _, spectrogram = log_specgram(test_sound, samplerate)
    plt.imshow(spectrogram.T, aspect='auto', origin='lower')
    plt.axis('off')

five_samples = [audio_path + 'five/' + y for y in os.listdir(audio_path + 'five/')[:6]]

fig = plt.figure(figsize=(10,10))
for i, filepath in enumerate(five_samples):
    plt.subplot(3,3,i+1)
    label = filepath.split('/')[-1]
    plt.title('"five": '+label)
    samplerate, test_sound  = wavfile.read(filepath)
    _, spectrogram = log_specgram(test_sound, samplerate)
    plt.imshow(spectrogram.T, aspect='auto', origin='lower')
    plt.axis('off')
Spec01 Spec02
Visualizing samples in Waveforms
fig = plt.figure(figsize=(8,20))
for i, filepath in enumerate(sample_audio[:6]):
    plt.subplot(9,1,i+1)
    samplerate, test_sound  = wavfile.read(filepath)
    plt.title(filepath.split('/')[-2])
    plt.axis('off')
    plt.plot(test_sound)

fig = plt.figure(figsize=(8,20))
for i, filepath in enumerate(five_samples):
    plt.subplot(9,1,i+1)
    samplerate, test_sound = wavfile.read(filepath)
    plt.title(filepath.split('/')[-2])
    plt.axis('off')
    plt.plot(test_sound)
Wave01 Wave02
Saving converted audio files to directory

You can use either the spectograms or the waveforms to feed the neural network. I used spectograms assuming it'll give me better results but I encourage you to try with waveforms.

def wav2img(wav_path, targetdir='', figsize=(4,4)):
    samplerate, test_sound  = wavfile.read(wav_path)
    _, spectrogram = log_specgram(test_sound, samplerate)
    output_file = wav_path.split('/')[-1].split('.wav')[0]
    output_file = targetdir +'/'+ output_file
    plt.imsave('%s.png' % output_file, spectrogram)
    plt.close()

def wav2img_waveform(wav_path, targetdir='', figsize=(4,4)):
    samplerate,test_sound  = wavfile.read(sample_audio[0])
    plt.plot(test_sound)
    plt.axis('off')
    output_file = wav_path.split('/')[-1].split('.wav')[0]
    output_file = targetdir +'/'+ output_file
    plt.savefig('%s.png' % output_file)
    plt.close()

for i, x in enumerate(subFolderList):
    all_files = [y for y in os.listdir(audio_path + x) if '.wav' in y]
    for file in all_files:
        wav2img(audio_path + x + '/' + file, pict_Path + x)
Creating Training and Validation Folder

The following code is to split the images directory to two (9:1) for training the neural network.

Val_DIR = 'train/val_pics/'
if not os.path.exists(Val_DIR):
    os.makedirs(Val_DIR)

folders = next(os.walk(pict_Path))[1]

for i in folders:
    files = next(os.walk(pict_Path + i))[2]
    size = len(files)
    part = int(size/10)
    for j in range(part):
        if not os.path.exists(Val_DIR + i):
            os.makedirs(Val_DIR + i)
        shutil.move(pict_Path + i + '/' + files[j], Val_DIR + i + '/' + files[j])
Constructing the Neural Network

I resized all my images to 128x128. You can choose a size that's greater or lower depending upon the computational power of your system.

image_height = 128
image_width = 128

The Convolutional Neural Network (CNN) is an exact replica of the one I created in one of my earlier blogs - The Simpson Classifier.

The only change is in the last line of the fully connected layer. It is '30' units (classifying 30 words) instead of '10'.

predator.add(Dense(units=30, activation="softmax"))

We also increase the number of steps for each epoch while 'fitting' the classifier as we have 60000+ images.

steps_per_epoch = 1200,
epochs = 10,
validation_steps = 120
Testing the Classifier on a new audio file

Make sure the recorded audio file is in .wav format (channel=mono). Then audio file is then converted to its spectogram for prediction.

wav2img('custom/sheila_mono.wav', 'custom')

test_image = image.load_img('custom/sheila_mono.png', target_size=(image_height, image_width))
test_image = image.img_to_array(test_image)
test_image = np.expand_dims(test_image, axis=0)
pred = predator.predict(test_image, batch_size=1, verbose=1)
Speech folders

The predicted class is labelled '1' and its corresponding word can be mapped with the 'folders' list that we created in one of the earlier steps.

This is a simple single-word classifier. Complex speech recognition algorithms use seq2seq/RNN to understand the context of, the word in the sentence and the sentence itself.