Keyword Spotting App using Transformers

Keyword Spotting is recognizing an uttered word from an audio clip. It’s input data is a waveform, and its output is the classification. In this project, I converted 1-d audio waveform to a 2-d mel-spectrogram. The horizontal axis of a spectrogram is time (in frames) and the vertical axis is frequency. The mel-spectrogram gives more “detail” to some frequencies to closer match human hearing. I have a more detailed discussion regarding spectrograms here.

Now that the input data is 2-dimensional, the project is now very similar to an image classification problem. Convolutional neural networks, which are very successful in the vision domain can be applied to audio/speech using this spectrogram conversion. In this project however, I tried to use transformers which were popular in the natural language domain, and have since been adapted in the vision domain.

When using transformers for images, the input image is typically split into a series of smaller image patches by using a 2-dimensional grid. For example a 256×256 pixel image, split using a 16×16 grid will result into patches of 16×16 pixels.

vision transformer embedding

I initially patchified my images this way and got 83% accuracy on the test set. The difference of a spectrogram with a natural image is that it already has an inherent sequential property. As stated earlier, the horizontal axis conveys time. After reading this paper by Berg et. al, I changed the patch grid size to be 1-dimensional. My patches are now of the size 128×2 “pixels”. This means I preserved the frequency axis and each patch contains two frames of the spectrogram.

I played around with different training hyperparameters and adding dropout layers to finally reach a test accuracy of 93.5%. The short video below is a sample demonstration application to showcase the model.

Other Articles

Address
Quezon City, PH

Work Hours
M-F  07:00-16:00