Keyword Spotting is recognizing an uttered word from an audio clip. It’s input data is a waveform, and its output is the classification. In this project, I converted 1-d audio waveform to a 2-d mel-spectrogram. The horizontal axis of a spectrogram is time (in frames) and the vertical axis is frequency. The mel-spectrogram gives more “detail” to some frequencies to closer match human hearing. I have a more detailed discussion regarding spectrograms here.
Now that the input data is 2-dimensional, the project is now very similar to an image classification problem. Convolutional neural networks, which are very successful in the vision domain can be applied to audio/speech using this spectrogram conversion. In this project however, I tried to use transformers which were popular in the natural language domain, and have since been adapted in the vision domain.
When using transformers for images, the input image is typically split into a series of smaller image patches by using a 2-dimensional grid. For example a 256×256 pixel image, split using a 16×16 grid will result into patches of 16×16 pixels.

I initially patchified my images this way and got 83% accuracy on the test set. The difference of a spectrogram with a natural image is that it already has an inherent sequential property. As stated earlier, the horizontal axis conveys time. After reading this paper by Berg et. al, I changed the patch grid size to be 1-dimensional. My patches are now of the size 128×2 “pixels”. This means I preserved the frequency axis and each patch contains two frames of the spectrogram.
I played around with different training hyperparameters and adding dropout layers to finally reach a test accuracy of 93.5%. The short video below is a sample demonstration application to showcase the model.
Other Articles
A marketing agency presents a promotional plan to a telecommunication company to increase its subscriber base and generate revenue. Using population and location data, we estimated the feasibility of this plan in this case study.
Mastering Pipelines: Integrating Feature Engineering into Your Predictive Models
Master predictive modeling with Scikit-Learn pipelines. Learn the importance of feature engineering and how to prevent data leakage.
Build an Automated ETL Pipeline. From setting up Docker to utilizing APIs and automating workflows with GitHub Actions, this post goes through it all.
Unlocking Data Science: Your Easy Docker Setup Guide
Ready to dive into data science? Learn how to set up your development environment using Docker for a seamless and reproducible workspace. Say goodbye to compatibility issues and hello to data science success!
In this article, I will discuss a way of accessing Google Cloud GPUs to train your Deep Learning projects.
However, to make it a little bit more scalable, the tables are defined in a separate Google sheet, and imported into the Google Colab notebook. Each node is a separate worksheet, and the columns list the parent nodes, the name of the node, and probability.
In this article, I fine-tuned a pre-trained object detection model using a small custom dataset.
In this article, we will discuss how to quickly showcase your model publicly as a machine learning application for free.
In this short project, I scraped all of Lebron's regular season points and plotted them in an interactive graph.
Predicting a Fitness Center’s Class Attendance with Machine Learning
In this project I analyzed a fitness center's attendance data to predict attendance rates of its group classes.
The tweet that started the trend was posted on June 12 (Philippine Independence day). On that tweet, the author stated that he dined in at a Tropical Hut branch, and he is their only customer. Despite the pandemic “restrictions”, commercial activity on malls and fast food chains are pretty much back to “normal”, so having only one customer is disheartening.
Code This case study is my capstone project for the Google Data Analytics Certification offered through Coursera. I was given access to …