13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow

Weights & Biases · Beginner ·🧬 Deep Learning ·7y ago

Skills: Supervised Learning90%ML Maths Basics80%CV Basics70%

Key Takeaways

The video demonstrates how to build a speech recognition model using Convolutional Neural Networks (CNNs) in Keras/TensorFlow, achieving over 90% accuracy on test data. It covers topics such as audio classification, spectrograms, and one-hot encoding.

Full Transcript

so audio is a huge field and it's actually arguably the field that really started the interest in deep learning so we are just gonna scratch the very very surface of audio in this video and what I really want to show you is that we can take the exact same techniques that we applied to text and image classification and apply it to audio now it's not totally obvious how you do that right I mean like audio comes in really different format than like an image or text right basically typically we represent it as kind of like a wave or maybe two waves if you have stereo sound so how do we actually get it in a format where we can process it and what do we do with it you know audio files are tend to be big and it tends to be just complicated to ingest them and handle them so I'm gonna do a very very small classification example the ideas we want to classify people saying different specific words and we're gonna see how well we can do that with some really simple caris techniques so here's the task we want to classify sounds and the sounds are people speaking and we classify them into what the person is saying so I found online WAV files of various people saying the words bed happy and cat and actually there were a lot more sets of WAV files there so you can follow a link we'll put in the comments to download more if you want to classify different words and what we're gonna do is we're going to take those WAV files do some transformations on it and then run various types of neural nets to see how well they classify this data so you know first of all we do this standard sort of importing libraries like Kerris and actually a pre process library that I mostly copied from another audio processing git project and that's as things like transform the WAV files into spectrograms so the next thing that's going to happen is we set the number of buckets in our spectrogram and we set the length of time that we want to operate over and then we use a function from this pre-processing library to transform these wav files into something that looks more like a sonic spectrogram now you may not have seen a spectrogram before you can find lots of apps that do this in a spectrogram the x-axis is time typically and the y-axis is the frequency of sound and then the darkness is the amount of energy at that frequency so in music or in science you typically get these spectrograms that have sort of even intervals or logarithmic intervals between the frequencies but actually when you're processing speech and there's a slightly different transformation that people typically do called M FCC and so that's the one that I do here but you can just roughly think of it as buckets of frequencies and kind of buckets of time so we do that transformation and then we actually load the training and test set into the familiar X train X test Whiterun Y test values this is just like you know previous videos X train was off in a set of images in previous videos in this video it's going to be sets of audio spectrograms essentially an X test is going to be validation data for that Y train is going to be the labels so 0 corresponds to bed one corresponds to happy and two corresponds to cat and Y test is the same but correspond to the the test data then we're gonna actually reshape our our data a little bit we're gonna add a channel element and this is because typically with audio you're gonna have a left channel and right channel now in this case we've actually removed the channel so there really is only one channel but this might make the code a little more generalizable to typical audio files that you'll see out there in the wild and then you know before we do anything else I think it's nice to take a look at the data that we're dealing with with the imshow commands and now that works super well when we're dealing with images right you can actually look at the imaging see oh that's a number 4 or oh that's a picture of my friend's face with audio spectrograms it's a little less clear what's going on but it's kind of nice to look at anyway so we could you know look at the hundredths value of X train and we can see that it seems like it starts off a little quieter and maybe gets a little bit louder it's a it's a little hard to interpret we can also print out the corresponding Y train label and see what that what that was and it looks to me like it must be the zeroeth label and that would be bed so this is this this is some kind of distorted spectrogram with somebody saying bed one more thing before I get we have to transform why train and why test into one hot versions of those so we talk about this a lot in in previous videos and you can find it there but essentially going from this single number to a vector numbers where the one corresponds to the label that we want and you then you know as usual we're gonna start with kind of the simplest possible model and in that case it's a perceptron so as usual we're gonna first call flatten to kind of remove all the structure of our data so the buckets in the length of the channel are going to flatten it all that out into a single vector and then we're gonna call a dense layer on that and that's going to be a fully connected layer and within this case three different outputs one corresponding to each word that we're trying to classify and the typical softmax activation function we use when we're trying to do multi-class classification we're gonna use categorical cross entropy as usual and the atom optimizer and we're also gonna report on accuracy in this case all right so let's let's fit that model and you can see that in this case because the data sets reasonably small the model runs quite fast but you know you can actually see that this very simple linear model gets us around approximately 80% accuracy on the the validation data which is not bad okay so now here's the really cool thing because we have our data in such a standard format we can actually pull from all the different types of models that we've built in earlier videos to make this model better so the first thing we can try and this is something that people really do we can apply a convolutional network to this now you might argue that maybe we should use a 1d convolution more like text and you can try that right because maybe you know each you can think of each frequency as a separate channel but because actually the channels do have meaning or the the frequencies do have meaning like two frequencies close to each other actually are kind of semantically close I think a two deconvolution is a reasonable thing also to try so let's start with that and you can find in my ml class videos directory you can actually find examples of all these different classifiers so let's actually just go into CN n dot pi and see what happens when we paste in a standard kind of one level convolutional neural network so we can just copy this model code right into our notebook here and now we just have to change the input shape to be buckets Len and channels and we can just set this to be a 3x3 convolution so the dense layer size to 128 we can compile the model in the same way and then we can fit the model in the exact same way and again because it's such a small number of samples it learns very fast let's take a look in the app and actually this model is very very good right so this model gets over 90% accuracy ninety-three ninety-four percent accuracy on our test stated it's right off the bat which is really cool we've actually taken the machinery that we've learned in different domains and applied it this toy different domain and the same intuition that we had that you know convolutions might work better actually turns out to be the case and you might think well if one convolution works well what about two convolutions so we can take this same thing that we did before and take a convolution and a pooling and then a second convolution in a pooling build this model here compile it and actually we could go into the go into a project we'd call this one perceptron we're gonna call this guy one convolution call this guy two convolutions you so you can see here that our two convolution model is actually slightly better than our one convolution model which is awesome it's like you know maybe a 94% accuracy versus a 93% accuracy but you know another thing is pretty glaring which is that this is the test accuracy and on the training data both the one convolution and the two convolution model have 100% accuracy right so you know it seems like we have an issue with overfitting and again we can apply all the intuitions that we learned on text and image data to this problem right and so the the clear thing to do when you see this the first thing to try is to add some dropout so let's put a little bit of dropout in our model if you put it in the same place that we did before so we can see model add dropout maybe dropout 1/4 the stuff and drop it across the stuff again compile the model and run fit and you see that the two convolutions just drop out is actually learning slower on the training data but it actually kind of continues to improve and the same thing happens on the test data right so it starts off a little bit worse but as it runs over time it gets better and better and better right so this dropout actually allows the model to fit the data even a little bit better than without the dropouts so all the things that we expect all the theory and intuitions that we've learned so far they apply it to audio equally as well as images or that I just think it's super cool maybe let that run a little bit then there's one more thing that you can try which we did on text which is we could take L STM's or gr use and apply it to audio right and this might make sense especially if we had variable length audio files are much longer audio files this might make sense I think actually CN NS probably make a little more sense for these tiny files where they run well but let's take a peek and see how they do so we can copy the code from our LS TM video and so when we copy the code and you see that actually we get an error and it's a it's a shaper and it's because LS TM expects a two-dimensional but not a three-dimensional input and so you get the scary error message in this case there remember we actually added the channel variable later so we could do a more complicated reshape but I think the simplest thing to do is just undo the reshaping that we did before and then we can try the lsdm now the illicium performance is significantly worse than the convolutions but that might be because we had a small STM it also could be the fact that our data is actually not very long and I think L STM's would matter more as the data gets much longer so we could spend some time really doing hyper parameter tuning and maybe get this STM to the same accuracy as CN NS but I'll just say for these kind of short audio files I think you know CNN's are gonna be faster train faster and run faster and probably the better choice but if we were classifying really long conversations that's where Alice teams might really shine I guess my biggest point here and we can go deeper in subsequent videos on all types of audio processing but the the big point that I want to make is actually the stuff that you're learning is really transferable across domains I mean domain expertise has a huge role to play here but this stuff with CN NS is surprisingly transferable in many different areas and I think that's just super exciting so we'll do some more videos on audio well sue some more videos on more complicated architectures can't weights do you

Original Description

Learn to build a Keras model for speech classification. Audio is the field that ignited industry interest in deep learning. Although the data doesn't look like the images and text we're used to processing, we can use similar techniques to take short speech sound bites and identify what someone is saying. Follow along with Lukas using the Python scripts here: https://github.com/lukas/ml-class/tree/master/videos/cnn-audio This is part of a long, free series of tutorials teaching engineers to do deep learning. Leave questions below, and check out more of our class videos: Class Videos: http://wandb.com/classes Weights & Biases: http://wandb.com

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Weights & Biases · Weights & Biases · 20 of 60

← Previous Next →

0. What is machine learning?

0. What is machine learning?

Weights & Biases

1. Build Your First Machine Learning Model

1. Build Your First Machine Learning Model

Weights & Biases

Intro to ML: Course Overview

Intro to ML: Course Overview

Weights & Biases

2. Multi-Layer Perceptrons

2. Multi-Layer Perceptrons

Weights & Biases

3. Convolutional Neural Networks

3. Convolutional Neural Networks

Weights & Biases

Weights & Biases at OpenAI

Weights & Biases at OpenAI

Weights & Biases

Why Experiment Tracking is Crucial to OpenAI

Why Experiment Tracking is Crucial to OpenAI

Weights & Biases

4. Autoencoders

4. Autoencoders

Weights & Biases

5. Sentiment Analysis

5. Sentiment Analysis

Weights & Biases

6. Recurrent Neural Networks [RNNs]

6. Recurrent Neural Networks [RNNs]

Weights & Biases

7. Text Generation using LSTMs and GRUs

7. Text Generation using LSTMs and GRUs

Weights & Biases

8. Text Classification Using Convolutional Neural Networks

8. Text Classification Using Convolutional Neural Networks

Weights & Biases

9. Hybrid LSTMs [Long Short-Term Memory]

9. Hybrid LSTMs [Long Short-Term Memory]

Weights & Biases

Toyota Research Institute on Experiment Tracking with Weights & Biases

Toyota Research Institute on Experiment Tracking with Weights & Biases

Weights & Biases

Weights and Biases - Developer Tools for Deep Learning

Weights and Biases - Developer Tools for Deep Learning

Weights & Biases

Introducing Weights & Biases

Introducing Weights & Biases

Weights & Biases

10. Seq2Seq Models

10. Seq2Seq Models

Weights & Biases

11. Transfer Learning for Domain-Specific Image Classification with Small Datasets

11. Transfer Learning for Domain-Specific Image Classification with Small Datasets

Weights & Biases

12. One-shot learning for teaching neural networks to classify objects never seen before

12. One-shot learning for teaching neural networks to classify objects never seen before

Weights & Biases

13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow

13. Speech Recognition with Convolutional Neural Networks in Keras/TensorFlow

Weights & Biases

14. Data Augmentation | Keras

14. Data Augmentation | Keras

Weights & Biases

15. Batch Size and Learning Rate in CNNs

15. Batch Size and Learning Rate in CNNs

Weights & Biases

Applied Deep Learning Fellowship Overview and Project Selection with Josh Tobin (2019)

Applied Deep Learning Fellowship Overview and Project Selection with Josh Tobin (2019)

Weights & Biases

Grading Rubric for AI Applications with Sergey Karayev (2019)

Grading Rubric for AI Applications with Sergey Karayev (2019)

Weights & Biases

16. Video Frame Prediction using CNNs and LSTMs (2019)

16. Video Frame Prediction using CNNs and LSTMs (2019)

Weights & Biases

Image to LaTeX - Applied Deep Learning Fellowship (2019)

Image to LaTeX - Applied Deep Learning Fellowship (2019)

Weights & Biases

17. Build and Deploy an Emotion Classifier (2019)

17. Build and Deploy an Emotion Classifier (2019)

Weights & Biases

Applied Deep Learning - Data Management with Josh Tobin (2019)

Applied Deep Learning - Data Management with Josh Tobin (2019)

Weights & Biases

Snorkel: Programming Training Data with Paroma Varma of Stanford University (2019)

Snorkel: Programming Training Data with Paroma Varma of Stanford University (2019)

Weights & Biases

Applied Deep Learning - Troubleshooting and Debugging with Josh Tobin (2019)

Applied Deep Learning - Troubleshooting and Debugging with Josh Tobin (2019)

Weights & Biases

Troubleshooting and Iterating ML Models with Lee Redden (2019)

Troubleshooting and Iterating ML Models with Lee Redden (2019)

Weights & Biases

Designing a Machine Learning Project with Neal Khosla (2019)

Designing a Machine Learning Project with Neal Khosla (2019)

Weights & Biases

Lukas Beiwald on ML Tools and Experiment Management (2019)

Lukas Beiwald on ML Tools and Experiment Management (2019)

Weights & Biases

Building Machine Learning Teams with Josh Tobin (2019)

Building Machine Learning Teams with Josh Tobin (2019)

Weights & Biases

Pieter Abeel on Potential Deep Learning Research Directions (2019)

Pieter Abeel on Potential Deep Learning Research Directions (2019)

Weights & Biases

Testing and Deployment of Deep Learning Models with Josh Tobin (2019)

Testing and Deployment of Deep Learning Models with Josh Tobin (2019)

Weights & Biases

Five Lessons for Team-Oriented Research with Peter Welder (2019)

Five Lessons for Team-Oriented Research with Peter Welder (2019)

Weights & Biases

Applied Deep Learning - Rosanne Liu on AI Research (2019)

Applied Deep Learning - Rosanne Liu on AI Research (2019)

Weights & Biases

Making the Mid-career Leap from Urban Design to Deep Learning/Data Science

Making the Mid-career Leap from Urban Design to Deep Learning/Data Science

Weights & Biases

Organizing ML projects — W&B walkthrough (2020)

Organizing ML projects — W&B walkthrough (2020)

Weights & Biases

Brandon Rohrer — Machine Learning in Production for Robots

Brandon Rohrer — Machine Learning in Production for Robots

Weights & Biases

Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars

Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars

Weights & Biases

My experiments with Reinforcement Learning with Jariullah Safi

My experiments with Reinforcement Learning with Jariullah Safi

Weights & Biases

Applications of Machine Learning to COVID-19 Research with Isaac Godfried

Applications of Machine Learning to COVID-19 Research with Isaac Godfried

Weights & Biases

Testing Machine Learning Models with Eric Schles

Testing Machine Learning Models with Eric Schles

Weights & Biases

How Linear Algebra is not like Algebra with Charles Frye

How Linear Algebra is not like Algebra with Charles Frye

Weights & Biases

Predicting Protein Structures using Deep Learning with Jonathan King

Predicting Protein Structures using Deep Learning with Jonathan King

Weights & Biases

Rachael Tatman — Conversational AI and Linguistics

Rachael Tatman — Conversational AI and Linguistics

Weights & Biases

Reformer by Han Lee

Reformer by Han Lee

Weights & Biases

Sequence Models with Pujaa Rajan

Sequence Models with Pujaa Rajan

Weights & Biases

GitHub Actions & Machine Learning Workflows with Hamel Husain

GitHub Actions & Machine Learning Workflows with Hamel Husain

Weights & Biases

Look Mom, No Indices! Vector Calculus with the Fréchet Derivative by Charles Frye

Look Mom, No Indices! Vector Calculus with the Fréchet Derivative by Charles Frye

Weights & Biases

Jack Clark — Building Trustworthy AI Systems

Jack Clark — Building Trustworthy AI Systems

Weights & Biases

Surprising Utility of Surprise: Why ML Uses Negative Log Probabilities - Charles Frye

Surprising Utility of Surprise: Why ML Uses Negative Log Probabilities - Charles Frye

Weights & Biases

Track your machine learning experiments locally, with W&B Local - Chris Van Pelt

Track your machine learning experiments locally, with W&B Local - Chris Van Pelt

Weights & Biases

Antipatterns in open source research code with Jariullah Safi

Antipatterns in open source research code with Jariullah Safi

Weights & Biases

Attention for time series forecasting & COVID predictions - Isaac Godfried

Attention for time series forecasting & COVID predictions - Isaac Godfried

Weights & Biases

Made with ML - Goku Mohandas

Made with ML - Goku Mohandas

Weights & Biases

Angela & Danielle — Designing ML Models for Millions of Consumer Robots

Angela & Danielle — Designing ML Models for Millions of Consumer Robots

Weights & Biases

Deep Learning Salon by Weights & Biases

Deep Learning Salon by Weights & Biases

Weights & Biases

This video teaches how to build a speech recognition model using CNNs in Keras/TensorFlow, covering topics such as audio classification, spectrograms, and one-hot encoding. By following this lesson, viewers can learn how to apply deep learning techniques to audio data and achieve high accuracy in speech recognition tasks.

Key Takeaways

Import necessary libraries and load audio data
Transform WAV files into spectrograms
Build a simple perceptron model and apply a 1D convolutional network
Use softmax activation and categorical cross entropy for multi-class classification
Add dropout to prevent overfitting
Compare performance of CNNs and LSTMs on short audio files

💡 Convolutional Neural Networks (CNNs) can be effectively used for speech recognition tasks, especially for short audio files, and can achieve high accuracy with the right architecture and hyperparameters.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Supervised Learning

View skill →

Auto Machine Learning (AutoML) Using AutoGluon

Auto Machine Learning (AutoML) Using AutoGluon

Coding the SARIMA Model : Time Series Talk

Coding the SARIMA Model : Time Series Talk

Code With Me : Logistic Regression (from scratch) !

Code With Me : Logistic Regression (from scratch) !

Machine Learning Tutorial Python - 8 Logistic Regression (Multiclass Classification)

Machine Learning Tutorial Python - 8 Logistic Regression (Multiclass Classification)

Predicting the Winning Team with Machine Learning

Predicting the Winning Team with Machine Learning

Air Quality Index Prediction in Python | Machine Learning Projects | GeeksforGeeks

Air Quality Index Prediction in Python | Machine Learning Projects | GeeksforGeeks

Related AI Lessons

Want to get started with deep learning

Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch

Reddit r/deeplearning

Building a Deepfake Detector From Scratch — What Nobody Tells You

Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media

Medium · Deep Learning

Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…

Learn about high-dimensional invariance and its relation to the flat 2D plane of neural networks, and how to apply these concepts to improve model performance

Medium · Deep Learning

Implementing Neural Style Transfer from Scratch: The Project That Started It All

Learn to implement Neural Style Transfer from scratch and understand its significance in deep learning

Medium · Deep Learning

Image Classification with ml5.js

The Coding Train