13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow

Weights & Biases · Beginner ·🧬 Deep Learning ·7y ago

Key Takeaways

The video demonstrates how to build a speech recognition model using Convolutional Neural Networks (CNNs) in Keras/TensorFlow, achieving over 90% accuracy on test data. It covers topics such as audio classification, spectrograms, and one-hot encoding.

Full Transcript

so audio is a huge field and it's actually arguably the field that really started the interest in deep learning so we are just gonna scratch the very very surface of audio in this video and what I really want to show you is that we can take the exact same techniques that we applied to text and image classification and apply it to audio now it's not totally obvious how you do that right I mean like audio comes in really different format than like an image or text right basically typically we represent it as kind of like a wave or maybe two waves if you have stereo sound so how do we actually get it in a format where we can process it and what do we do with it you know audio files are tend to be big and it tends to be just complicated to ingest them and handle them so I'm gonna do a very very small classification example the ideas we want to classify people saying different specific words and we're gonna see how well we can do that with some really simple caris techniques so here's the task we want to classify sounds and the sounds are people speaking and we classify them into what the person is saying so I found online WAV files of various people saying the words bed happy and cat and actually there were a lot more sets of WAV files there so you can follow a link we'll put in the comments to download more if you want to classify different words and what we're gonna do is we're going to take those WAV files do some transformations on it and then run various types of neural nets to see how well they classify this data so you know first of all we do this standard sort of importing libraries like Kerris and actually a pre process library that I mostly copied from another audio processing git project and that's as things like transform the WAV files into spectrograms so the next thing that's going to happen is we set the number of buckets in our spectrogram and we set the length of time that we want to operate over and then we use a function from this pre-processing library to transform these wav files into something that looks more like a sonic spectrogram now you may not have seen a spectrogram before you can find lots of apps that do this in a spectrogram the x-axis is time typically and the y-axis is the frequency of sound and then the darkness is the amount of energy at that frequency so in music or in science you typically get these spectrograms that have sort of even intervals or logarithmic intervals between the frequencies but actually when you're processing speech and there's a slightly different transformation that people typically do called M FCC and so that's the one that I do here but you can just roughly think of it as buckets of frequencies and kind of buckets of time so we do that transformation and then we actually load the training and test set into the familiar X train X test Whiterun Y test values this is just like you know previous videos X train was off in a set of images in previous videos in this video it's going to be sets of audio spectrograms essentially an X test is going to be validation data for that Y train is going to be the labels so 0 corresponds to bed one corresponds to happy and two corresponds to cat and Y test is the same but correspond to the the test data then we're gonna actually reshape our our data a little bit we're gonna add a channel element and this is because typically with audio you're gonna have a left channel and right channel now in this case we've actually removed the channel so there really is only one channel but this might make the code a little more generalizable to typical audio files that you'll see out there in the wild and then you know before we do anything else I think it's nice to take a look at the data that we're dealing with with the imshow commands and now that works super well when we're dealing with images right you can actually look at the imaging see oh that's a number 4 or oh that's a picture of my friend's face with audio spectrograms it's a little less clear what's going on but it's kind of nice to look at anyway so we could you know look at the hundredths value of X train and we can see that it seems like it starts off a little quieter and maybe gets a little bit louder it's a it's a little hard to interpret we can also print out the corresponding Y train label and see what that what that was and it looks to me like it must be the zeroeth label and that would be bed so this is this this is some kind of distorted spectrogram with somebody saying bed one more thing before I get we have to transform why train and why test into one hot versions of those so we talk about this a lot in in previous videos and you can find it there but essentially going from this single number to a vector numbers where the one corresponds to the label that we want and you then you know as usual we're gonna start with kind of the simplest possible model and in that case it's a perceptron so as usual we're gonna first call flatten to kind of remove all the structure of our data so the buckets in the length of the channel are going to flatten it all that out into a single vector and then we're gonna call a dense layer on that and that's going to be a fully connected layer and within this case three different outputs one corresponding to each word that we're trying to classify and the typical softmax activation function we use when we're trying to do multi-class classification we're gonna use categorical cross entropy as usual and the atom optimizer and we're also gonna report on accuracy in this case all right so let's let's fit that model and you can see that in this case because the data sets reasonably small the model runs quite fast but you know you can actually see that this very simple linear model gets us around approximately 80% accuracy on the the validation data which is not bad okay so now here's the really cool thing because we have our data in such a standard format we can actually pull from all the different types of models that we've built in earlier videos to make this model better so the first thing we can try and this is something that people really do we can apply a convolutional network to this now you might argue that maybe we should use a 1d convolution more like text and you can try that right because maybe you know each you can think of each frequency as a separate channel but because actually the channels do have meaning or the the frequencies do have meaning like two frequencies close to each other actually are kind of semantically close I think a two deconvolution is a reasonable thing also to try so let's start with that and you can find in my ml class videos directory you can actually find examples of all these different classifiers so let's actually just go into CN n dot pi and see what happens when we paste in a standard kind of one level convolutional neural network so we can just copy this model code right into our notebook here and now we just have to change the input shape to be buckets Len and channels and we can just set this to be a 3x3 convolution so the dense layer size to 128 we can compile the model in the same way and then we can fit the model in the exact same way and again because it's such a small number of samples it learns very fast let's take a look in the app and actually this model is very very good right so this model gets over 90% accuracy ninety-three ninety-four percent accuracy on our test stated it's right off the bat which is really cool we've actually taken the machinery that we've learned in different domains and applied it this toy different domain and the same intuition that we had that you know convolutions might work better actually turns out to be the case and you might think well if one convolution works well what about two convolutions so we can take this same thing that we did before and take a convolution and a pooling and then a second convolution in a pooling build this model here compile it and actually we could go into the go into a project we'd call this one perceptron we're gonna call this guy one convolution call this guy two convolutions you so you can see here that our two convolution model is actually slightly better than our one convolution model which is awesome it's like you know maybe a 94% accuracy versus a 93% accuracy but you know another thing is pretty glaring which is that this is the test accuracy and on the training data both the one convolution and the two convolution model have 100% accuracy right so you know it seems like we have an issue with overfitting and again we can apply all the intuitions that we learned on text and image data to this problem right and so the the clear thing to do when you see this the first thing to try is to add some dropout so let's put a little bit of dropout in our model if you put it in the same place that we did before so we can see model add dropout maybe dropout 1/4 the stuff and drop it across the stuff again compile the model and run fit and you see that the two convolutions just drop out is actually learning slower on the training data but it actually kind of continues to improve and the same thing happens on the test data right so it starts off a little bit worse but as it runs over time it gets better and better and better right so this dropout actually allows the model to fit the data even a little bit better than without the dropouts so all the things that we expect all the theory and intuitions that we've learned so far they apply it to audio equally as well as images or that I just think it's super cool maybe let that run a little bit then there's one more thing that you can try which we did on text which is we could take L STM's or gr use and apply it to audio right and this might make sense especially if we had variable length audio files are much longer audio files this might make sense I think actually CN NS probably make a little more sense for these tiny files where they run well but let's take a peek and see how they do so we can copy the code from our LS TM video and so when we copy the code and you see that actually we get an error and it's a it's a shaper and it's because LS TM expects a two-dimensional but not a three-dimensional input and so you get the scary error message in this case there remember we actually added the channel variable later so we could do a more complicated reshape but I think the simplest thing to do is just undo the reshaping that we did before and then we can try the lsdm now the illicium performance is significantly worse than the convolutions but that might be because we had a small STM it also could be the fact that our data is actually not very long and I think L STM's would matter more as the data gets much longer so we could spend some time really doing hyper parameter tuning and maybe get this STM to the same accuracy as CN NS but I'll just say for these kind of short audio files I think you know CNN's are gonna be faster train faster and run faster and probably the better choice but if we were classifying really long conversations that's where Alice teams might really shine I guess my biggest point here and we can go deeper in subsequent videos on all types of audio processing but the the big point that I want to make is actually the stuff that you're learning is really transferable across domains I mean domain expertise has a huge role to play here but this stuff with CN NS is surprisingly transferable in many different areas and I think that's just super exciting so we'll do some more videos on audio well sue some more videos on more complicated architectures can't weights do you

Original Description

Learn to build a Keras model for speech classification. Audio is the field that ignited industry interest in deep learning. Although the data doesn't look like the images and text we're used to processing, we can use similar techniques to take short speech sound bites and identify what someone is saying. Follow along with Lukas using the Python scripts here: https://github.com/lukas/ml-class/tree/master/videos/cnn-audio This is part of a long, free series of tutorials teaching engineers to do deep learning. Leave questions below, and check out more of our class videos: Class Videos: http://wandb.com/classes Weights & Biases: http://wandb.com
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Weights & Biases · Weights & Biases · 20 of 60

1 0. What is machine learning?
0. What is machine learning?
Weights & Biases
2 1. Build Your First Machine Learning Model
1. Build Your First Machine Learning Model
Weights & Biases
3 Intro to ML: Course Overview
Intro to ML: Course Overview
Weights & Biases
4 2. Multi-Layer Perceptrons
2. Multi-Layer Perceptrons
Weights & Biases
5 3. Convolutional Neural Networks
3. Convolutional Neural Networks
Weights & Biases
6 Weights & Biases at OpenAI
Weights & Biases at OpenAI
Weights & Biases
7 Why Experiment Tracking is Crucial to OpenAI
Why Experiment Tracking is Crucial to OpenAI
Weights & Biases
8 4. Autoencoders
4. Autoencoders
Weights & Biases
9 5. Sentiment Analysis
5. Sentiment Analysis
Weights & Biases
10 6. Recurrent Neural Networks [RNNs]
6. Recurrent Neural Networks [RNNs]
Weights & Biases
11 7. Text Generation using LSTMs and GRUs
7. Text Generation using LSTMs and GRUs
Weights & Biases
12 8. Text Classification Using Convolutional Neural Networks
8. Text Classification Using Convolutional Neural Networks
Weights & Biases
13 9. Hybrid LSTMs [Long Short-Term Memory]
9. Hybrid LSTMs [Long Short-Term Memory]
Weights & Biases
14 Toyota Research Institute on Experiment Tracking with Weights & Biases
Toyota Research Institute on Experiment Tracking with Weights & Biases
Weights & Biases
15 Weights and Biases - Developer Tools for Deep Learning
Weights and Biases - Developer Tools for Deep Learning
Weights & Biases
16 Introducing Weights & Biases
Introducing Weights & Biases
Weights & Biases
17 10. Seq2Seq Models
10. Seq2Seq Models
Weights & Biases
18 11. Transfer Learning for Domain-Specific Image Classification with Small Datasets
11. Transfer Learning for Domain-Specific Image Classification with Small Datasets
Weights & Biases
19 12. One-shot learning for teaching neural networks to classify objects never seen before
12. One-shot learning for teaching neural networks to classify objects never seen before
Weights & Biases
13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow
13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow
Weights & Biases
21 14. Data Augmentation | Keras
14. Data Augmentation | Keras
Weights & Biases
22 15. Batch Size and Learning Rate in CNNs
15. Batch Size and Learning Rate in CNNs
Weights & Biases
23 Applied Deep Learning Fellowship Overview and Project Selection with Josh Tobin (2019)
Applied Deep Learning Fellowship Overview and Project Selection with Josh Tobin (2019)
Weights & Biases
24 Grading Rubric for AI Applications with Sergey Karayev  (2019)
Grading Rubric for AI Applications with Sergey Karayev (2019)
Weights & Biases
25 16. Video Frame Prediction using CNNs and LSTMs (2019)
16. Video Frame Prediction using CNNs and LSTMs (2019)
Weights & Biases
26 Image to LaTeX - Applied Deep Learning Fellowship (2019)
Image to LaTeX - Applied Deep Learning Fellowship (2019)
Weights & Biases
27 17.  Build and Deploy an Emotion Classifier (2019)
17. Build and Deploy an Emotion Classifier (2019)
Weights & Biases
28 Applied Deep Learning - Data Management with Josh Tobin (2019)
Applied Deep Learning - Data Management with Josh Tobin (2019)
Weights & Biases
29 Snorkel: Programming Training Data with Paroma Varma of Stanford University (2019)
Snorkel: Programming Training Data with Paroma Varma of Stanford University (2019)
Weights & Biases
30 Applied Deep Learning - Troubleshooting and Debugging with Josh Tobin (2019)
Applied Deep Learning - Troubleshooting and Debugging with Josh Tobin (2019)
Weights & Biases
31 Troubleshooting and Iterating ML Models with Lee Redden (2019)
Troubleshooting and Iterating ML Models with Lee Redden (2019)
Weights & Biases
32 Designing a Machine Learning Project with Neal Khosla (2019)
Designing a Machine Learning Project with Neal Khosla (2019)
Weights & Biases
33 Lukas Beiwald on ML Tools and Experiment Management (2019)
Lukas Beiwald on ML Tools and Experiment Management (2019)
Weights & Biases
34 Building Machine Learning Teams with Josh Tobin (2019)
Building Machine Learning Teams with Josh Tobin (2019)
Weights & Biases
35 Pieter Abeel on Potential Deep Learning Research Directions  (2019)
Pieter Abeel on Potential Deep Learning Research Directions (2019)
Weights & Biases
36 Testing and Deployment of Deep Learning Models with Josh Tobin (2019)
Testing and Deployment of Deep Learning Models with Josh Tobin (2019)
Weights & Biases
37 Five Lessons for Team-Oriented Research with Peter Welder (2019)
Five Lessons for Team-Oriented Research with Peter Welder (2019)
Weights & Biases
38 Applied Deep Learning - Rosanne Liu on AI Research (2019)
Applied Deep Learning - Rosanne Liu on AI Research (2019)
Weights & Biases
39 Making the Mid-career Leap from Urban Design to Deep Learning/Data Science
Making the Mid-career Leap from Urban Design to Deep Learning/Data Science
Weights & Biases
40 Organizing ML projects — W&B walkthrough (2020)
Organizing ML projects — W&B walkthrough (2020)
Weights & Biases
41 Brandon Rohrer — Machine Learning in Production for Robots
Brandon Rohrer — Machine Learning in Production for Robots
Weights & Biases
42 Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars
Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars
Weights & Biases
43 My experiments with Reinforcement Learning with Jariullah Safi
My experiments with Reinforcement Learning with Jariullah Safi
Weights & Biases
44 Applications of Machine Learning to COVID-19 Research with Isaac Godfried
Applications of Machine Learning to COVID-19 Research with Isaac Godfried
Weights & Biases
45 Testing Machine Learning Models with Eric Schles
Testing Machine Learning Models with Eric Schles
Weights & Biases
46 How Linear Algebra is not like Algebra with Charles Frye
How Linear Algebra is not like Algebra with Charles Frye
Weights & Biases
47 Predicting Protein Structures using Deep Learning with Jonathan King
Predicting Protein Structures using Deep Learning with Jonathan King
Weights & Biases
48 Rachael Tatman — Conversational AI and Linguistics
Rachael Tatman — Conversational AI and Linguistics
Weights & Biases
49 Reformer by Han Lee
Reformer by Han Lee
Weights & Biases
50 Sequence Models with Pujaa Rajan
Sequence Models with Pujaa Rajan
Weights & Biases
51 GitHub Actions & Machine Learning Workflows with Hamel Husain
GitHub Actions & Machine Learning Workflows with Hamel Husain
Weights & Biases
52 Look Mom, No Indices! Vector Calculus with the Fréchet Derivative by Charles Frye
Look Mom, No Indices! Vector Calculus with the Fréchet Derivative by Charles Frye
Weights & Biases
53 Jack Clark — Building Trustworthy AI Systems
Jack Clark — Building Trustworthy AI Systems
Weights & Biases
54 Surprising Utility of Surprise: Why ML Uses Negative Log Probabilities - Charles Frye
Surprising Utility of Surprise: Why ML Uses Negative Log Probabilities - Charles Frye
Weights & Biases
55 Track your machine learning experiments locally, with W&B Local - Chris Van Pelt
Track your machine learning experiments locally, with W&B Local - Chris Van Pelt
Weights & Biases
56 Antipatterns in open source research code with Jariullah Safi
Antipatterns in open source research code with Jariullah Safi
Weights & Biases
57 Attention for time series forecasting & COVID predictions - Isaac Godfried
Attention for time series forecasting & COVID predictions - Isaac Godfried
Weights & Biases
58 Made with ML - Goku Mohandas
Made with ML - Goku Mohandas
Weights & Biases
59 Angela & Danielle — Designing ML Models for Millions of Consumer Robots
Angela & Danielle — Designing ML Models for Millions of Consumer Robots
Weights & Biases
60 Deep Learning Salon by Weights & Biases
Deep Learning Salon by Weights & Biases
Weights & Biases

This video teaches how to build a speech recognition model using CNNs in Keras/TensorFlow, covering topics such as audio classification, spectrograms, and one-hot encoding. By following this lesson, viewers can learn how to apply deep learning techniques to audio data and achieve high accuracy in speech recognition tasks.

Key Takeaways
  1. Import necessary libraries and load audio data
  2. Transform WAV files into spectrograms
  3. Build a simple perceptron model and apply a 1D convolutional network
  4. Use softmax activation and categorical cross entropy for multi-class classification
  5. Add dropout to prevent overfitting
  6. Compare performance of CNNs and LSTMs on short audio files
💡 Convolutional Neural Networks (CNNs) can be effectively used for speech recognition tasks, especially for short audio files, and can achieve high accuracy with the right architecture and hyperparameters.

Related AI Lessons

Want to get started with deep learning
Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Learn about high-dimensional invariance and its relation to the flat 2D plane of neural networks, and how to apply these concepts to improve model performance
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Learn to implement Neural Style Transfer from scratch and understand its significance in deep learning
Medium · Deep Learning
Up next
Image Classification with ml5.js
The Coding Train
Watch →