Automatic Speech Recognition (ASR) with Facebook AI's wav2vec 2.0 model using Huggingface

Imaad Mohamed Khan · Advanced ·📄 Research Papers Explained ·4y ago

Key Takeaways

The video discusses the implementation of Facebook AI's wav2vec 2.0 model for Automatic Speech Recognition (ASR) using Huggingface, with a focus on the wav2vec2-base-960h model and its application in ASR without a language model.

Full Transcript

hey everyone hello and welcome to yet another video in today's video we will take a look at of model that has been making the news in the nlp world uh wave to week 2.0 this is a framework for self-supervised learning of speech representations this was released earl towards the end of last year by the facebook ai team and what's really interesting about this is that this paper claims to require way lesser training data than that is usually required for us to be able to train automatic speech recognition models so just quickly go through the abstract before we go on to the code implementation uh which i have done using hugging face so we'll first take a look at the abstract so the abstract is we show for the first time that learning powerful representations from speech audio alone followed by fine tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler and if you see here i think somewhere they mentioned while lowering the amount of label data to one r wave to vect 2.0 outperforms the previous state of art on the 100r subset basically a 100x improvement right in terms of the amount of label data that is required using just 10 minutes of label data and pre-training on 53 000 hours of unlabeled data we still achieve four point eight slash eight point two w er we are as word error rate this demonstrates the feasibility of speech recognition with limited amounts of labor data and this is a very important problem important uh what do you say this is a very important advancement because a lot of languages around the world don't have the kind of label data that you need to do automatic space recognition because what you need is basically speech samples and a transcribed text of what the speech is about right so that is not readily available for many languages in the world and building that dataset takes time takes a lot of effort lots of lots of money uh and this this paper says that we can reduce that by a lot of margin and and that is why it's really exciting uh now let me quickly move on to the hugging face wave to wake base 960 h model this is the model that has been trained by i don't remember the name and that is not mentioned here but it has been trained by somebody and upload it pre-trained rather pre-trained by somebody and upload it to hugging faces models repository and we'll be using this for our demo today so what i'm going to demo today is basically i've already recorded a hello world basically kind of uh sent in just one sentence of our 34 seconds i'll first play that out for you and then we'll see how that gets transcribed using this wave to wick base 960 h model okay let me quickly see if i can play that out for you and yeah hello world dot wave hello world this is a test i'm just going to play that out once again hello world this is a test okay so let's move on to the code where i had written this down and i mean i worked on this on a demo that somebody else had built but i've tried to uh write down comments and put it in a more readable format so let me just run these installations basically here checking for the python version nvidia sma and then i'm installing a few libraries that i will be using in order to get this demo going okay i i really like this new arrow that collab is showing these days okay so we've already installed the requirements now let me just import the libraries i'm going to import nltk which is a natural language processing library librosa is a audio processing library arc bars again libros uh you don't need multiple times other parts also i don't need right now because arc path is mostly for taking in command line arguments import torch this is by doors uh and this is a very popular machine learning rather deep learning framework import sound file as i said this is another library you can use to manipulate sound sound files from transformers import wave to vect to for ctc and wave to vector tokenizer and these are the two uh functions we are going to use from the like hugging phase library transformers lltk.download.pu and kt you need to download this in order to not run into any issues later this is a tokenizer from ltk so let's just import all of these okay so we have run the cell now i have a few functions here i'll just run this functions first and run the code first and then we'll go through each of that so this now starts downloading uh the model onto the machine okay error opening hello underscore world.wev that's possibly probably because it doesn't exist here anymore so what i could do is go to downloads say hello underscore world or wav and put it here yeah so you need to upload your sample file and run this again okay let me rename this to hello underscore okay we have a transcribed version over here it says hello world this says a test and what i had said was hello well this is a test and this was what we were able to transcribe now quickly let me take you through the functions so this is where we are giving file as input so hello underscore world.wav and this is what i had uploaded then i'm calling this load underscore wave to vect underscore 960 h model function uh and the this function returns a tokenizer in the model let me just quickly take you to that function this is the function so the iterations tokenizer in the model from free trained tokenizers and models what it does it goes and uh fetches so hugging phase maintains a repository of different models from bird to roberta and all these kinds of nlp models right so every new model that comes out hugging face almost has it in the vehicle source it's really fast that way so we have a pre-trained wave to back base 960 h model which if you see here it in more detail the base model pre-trained and fine-tuned on 960 hours of delivery speech on 16 kilohertz sample speech audio when using the model make sure that your speech input is also sampled at 16 kilohertz so this is the model we are using and uh yeah this has been pre-trained and fine-tuned on 960 hours of labor speech on 16 kilohertz okay so this is the model that we are fetching here and this is the tokenizer that we are using okay so after i call this i call this function asr underscore transcript automatic speech recognition just go transcript and we here give tokenizer model and the wave input file as parameters and if i show you that function here okay so returns the transcript of the input audio recording your output is the transcribed text your input is your tokenizer model and wav file so you're reading the file and that is the wp file you're getting speech and sample rate making it one dimensional because your input is supposed to be one dimensional array and then you're resampling it to 16 kilohertz and like we saw over here when you're using the model make sure that your speech input is also sampled at 16 kilohertz so we are resampling it to 16 kilohertz tokenizing it this is uh pt pi dots uh getting the input values from uh basically tokenizing and getting the tokenized values taking logits over it and after that taken argh max which is basically finding the most probable word id out of all the predicted ids and then we take this protected ids and decode using the tokenizer to get the transcription okay and this is how we are able to get the transcription but the transcription once you uh take the arg max and then you decode using tokenizer is all caps what you saw as the output was not all caps it was hello world this is a test so what we've done here is that we've tried to correct that using correct upper case sentence function so we first change the transcription to lower and wherever we need to uppercase we uppercase and this is the function that does that for us okay we so tokenize the sentence and then for the first word we capitalize and return that over here this is the transcription you see and at the end of this function you just return this transcription after all of this processing has happened and then yeah basically store that as text over here and then we just print this text and this text gives you hello world this is a test now this is not very accurate we could perhaps train it on more data or uh or actually we don't really use a language model here right so most of these problems are solved by using the language model at the end so perhaps trying that out along with this could be a useful thing to do but uh that's it for now i mean we've seen the journey from uh giving a wave input file as a wave file as input to seeing a transcribed text output using facebook's wave 2 vector model that has been pre-trained on and uploaded on hung face so this is what we've seen today and i hope this video was useful for you to take a look at how you you could also train your automatic speech recognition system using this model i'll be sharing the links of all of these things that we've seen today in the description box below thank you so much for watching please do subscribe to the channel and keep supporting like you've always done thank you

Original Description

Facebook AI's wav2vec 2.0 is a new framework that claims to perform Automatic Speech Recognition without using a language model. In this video we will quickly take a look at the abstract of the paper and then move on to the implementation of this system using Huggingface. Huggingface provides us with wav2vec2-base-960h model that can be used to perform ASR. As described in the video, here are the relevant links: 1. Link to the paper - https://arxiv.org/abs/2006.11477 2. Link to Huggingface's wave2vec 2.0 model page - https://huggingface.co/facebook/wav2vec2-base-960h 3. Link to the Colab notebook - https://colab.research.google.com/drive/1dnNrGy1U260L403OuhTsDjBQkdGHmvL9
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Imaad Mohamed Khan · Imaad Mohamed Khan · 27 of 34

1 Does AI know Fashion? - Mitali Sodhi - Mantissa Data Science Meetups
Does AI know Fashion? - Mitali Sodhi - Mantissa Data Science Meetups
Imaad Mohamed Khan
2 Mantissa Data Science Webinar - 1 with Santhosh Shetty
Mantissa Data Science Webinar - 1 with Santhosh Shetty
Imaad Mohamed Khan
3 Recommender Systems -  Imaad Mohamed Khan - Mantissa Data Science Meetups
Recommender Systems - Imaad Mohamed Khan - Mantissa Data Science Meetups
Imaad Mohamed Khan
4 Data Science is more than just Data Scientist - Different Roles in the field of Data Science
Data Science is more than just Data Scientist - Different Roles in the field of Data Science
Imaad Mohamed Khan
5 What topics to prepare for Data Science Interviews in 2020?
What topics to prepare for Data Science Interviews in 2020?
Imaad Mohamed Khan
6 Programming as a human activity
Programming as a human activity
Imaad Mohamed Khan
7 What are the languages or tools used by Data Scientists in their work?
What are the languages or tools used by Data Scientists in their work?
Imaad Mohamed Khan
8 Linear Regression From Scratch - Part 1
Linear Regression From Scratch - Part 1
Imaad Mohamed Khan
9 Linear Regression From Scratch - Part 2
Linear Regression From Scratch - Part 2
Imaad Mohamed Khan
10 Linear Regression From Scratch - Part 3
Linear Regression From Scratch - Part 3
Imaad Mohamed Khan
11 Journey into Data Science - Fireside chat with Adarsha and Karthikeyan
Journey into Data Science - Fireside chat with Adarsha and Karthikeyan
Imaad Mohamed Khan
12 Off the ground - Python in 5 Steps
Off the ground - Python in 5 Steps
Imaad Mohamed Khan
13 How LinkedIn uses Data Science to build your feed - LinkedIn Feed Algorithm Explained
How LinkedIn uses Data Science to build your feed - LinkedIn Feed Algorithm Explained
Imaad Mohamed Khan
14 Fireside chat with Eric Weber - Learnings in Data Science
Fireside chat with Eric Weber - Learnings in Data Science
Imaad Mohamed Khan
15 Part 2 - How LinkedIn uses Data Science to build your feed | LinkedIn Feed Algorithm Explained
Part 2 - How LinkedIn uses Data Science to build your feed | LinkedIn Feed Algorithm Explained
Imaad Mohamed Khan
16 Using Streamlit's Share Feature to easily deploy (and share) videos using Github
Using Streamlit's Share Feature to easily deploy (and share) videos using Github
Imaad Mohamed Khan
17 Airbnb Experiences Ranking Algorithm Explained - Part I
Airbnb Experiences Ranking Algorithm Explained - Part I
Imaad Mohamed Khan
18 Airbnb Experiences Ranking Algorithm Explained - Part II
Airbnb Experiences Ranking Algorithm Explained - Part II
Imaad Mohamed Khan
19 Airbnb Experiences Ranking Algorithm Explained - Part III
Airbnb Experiences Ranking Algorithm Explained - Part III
Imaad Mohamed Khan
20 Big Data, Hadoop and Machine Learning Explained using Dams
Big Data, Hadoop and Machine Learning Explained using Dams
Imaad Mohamed Khan
21 Fireside Chat with Hiromu Hota - Transitioning from Research to Industry
Fireside Chat with Hiromu Hota - Transitioning from Research to Industry
Imaad Mohamed Khan
22 Introduction to Anomaly Detection and One Class Classification
Introduction to Anomaly Detection and One Class Classification
Imaad Mohamed Khan
23 Reading and manipulating Google Sheets (GSheets) using Python libraries
Reading and manipulating Google Sheets (GSheets) using Python libraries
Imaad Mohamed Khan
24 Writing to Google Sheets (GSheets) using Python libraries
Writing to Google Sheets (GSheets) using Python libraries
Imaad Mohamed Khan
25 Fireside Chat with Mirza Rahim Baig - Business Problem Solving and Data Science Career Tips
Fireside Chat with Mirza Rahim Baig - Business Problem Solving and Data Science Career Tips
Imaad Mohamed Khan
26 Six types of Data Analysis you will do as a Data Scientist
Six types of Data Analysis you will do as a Data Scientist
Imaad Mohamed Khan
Automatic Speech Recognition (ASR) with Facebook AI's wav2vec 2.0 model using Huggingface
Automatic Speech Recognition (ASR) with Facebook AI's wav2vec 2.0 model using Huggingface
Imaad Mohamed Khan
28 9 Anti-patterns to avoid MLOps mistakes
9 Anti-patterns to avoid MLOps mistakes
Imaad Mohamed Khan
29 8 pitfalls to avoid while using Machine Learning Interpretation Techniques (SHAP, PDP, LIME, PFI)
8 pitfalls to avoid while using Machine Learning Interpretation Techniques (SHAP, PDP, LIME, PFI)
Imaad Mohamed Khan
30 Fireside Chat with Shadab Khan - AI in Healthcare and Data Science Career Tips
Fireside Chat with Shadab Khan - AI in Healthcare and Data Science Career Tips
Imaad Mohamed Khan
31 Features and Feature Engineering in Machine Learning - An Introduction
Features and Feature Engineering in Machine Learning - An Introduction
Imaad Mohamed Khan
32 Building your own AI text generation tool with aitextgen using GPT-2/GPT-3
Building your own AI text generation tool with aitextgen using GPT-2/GPT-3
Imaad Mohamed Khan
33 Organising Data Science projects using CRISP-DM
Organising Data Science projects using CRISP-DM
Imaad Mohamed Khan
34 Introduction to Prompt Engineering
Introduction to Prompt Engineering
Imaad Mohamed Khan

The video teaches how to implement Facebook AI's wav2vec 2.0 model for ASR using Huggingface, covering the abstract of the paper and the implementation using the wav2vec2-base-960h model. This is useful for those interested in speech recognition and natural language processing.

Key Takeaways
  1. Read the abstract of the wav2vec 2.0 paper
  2. Explore the Huggingface model page for wav2vec2-base-960h
  3. Use the Colab notebook to implement the ASR system
  4. Experiment with the wav2vec 2.0 model for ASR tasks
  5. Evaluate the performance of the ASR system
💡 The wav2vec 2.0 model can perform ASR without using a language model, making it a significant advancement in speech recognition technology.

Related AI Lessons

I Spent Weeks Looking for a Research Gap Before I Realized I Was Searching the Wrong Way
Learn how to effectively find research gaps by changing your approach, a crucial skill for AI researchers and academics
Medium · AI
ICMI 2026 Reviews [D]
Learn how to interpret ICMI 2026 reviews and improve your paper's acceptance chances
Reddit r/MachineLearning
Workshop submission for main conference paper under review [D]
Learn how to navigate submitting a paper to a non-archival workshop before the final decision of a main conference like ECCV
Reddit r/MachineLearning
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
Streamline your research with a new Chrome extension and website that integrates 3M papers from arxiv, OpenReview, GitHub, and HuggingFace, including citation graphs and SPECTER2 neighbors, and provide feedback to improve it
Reddit r/MachineLearning
Up next
How to Open HSD Files (Husqvarna Viking Designer Embroidery)
File Extension Geeks
Watch →