Neural Voice Cloning

CodeEmporium · Beginner ·📐 ML Fundamentals ·8y ago

Skills: Neural Network Basics80%ML Maths Basics60%

Key Takeaways

The video explores Neural Voice Cloning, a technology developed by Baidu, which allows cloning an unseen speaker's voice with only a few sound clips, as presented in a research paper on arXiv.

Full Transcript

wouldn't it be cool to have an AI that listens to you speak for a few seconds and then is able to say different things in your voice well it exists and before getting into details I'll show you exactly what it can do consider the original audio clip this is how the speaker sounds the regional newspapers have outperformed the national titles listening to a few seconds of audio like this from a speaker the AI will be able to take some text input and can generate new audio saying that text like this the large items are put into containers for disposal so you can make an AI say anything you want and in your voice that's pretty slick right I'm AJ Hathor and in this video we're gonna take a look at how exactly such a neural voice cloning system works stay notified about my videos by clicking that subscribe button and hitting that Bell icon now let's get to it last month researchers at Baidu Silicon Valley AI lab developed a neural voice cloning system this system requires only a few samples of single speaker audio to generate speech in the speaker's voice I stress on the term using a few samples because until now neural networks required copious amounts of data in order to train themselves to actually perform any type of task they need such large amounts of training data because of the thousands and even millions of training parameters that they require to estimate the paper neural voice cloning with a few samples proposes two different methods to perform voice cloning the first is speaker adaptation this involves just tweaking or fine-tuning a pre trained model and hence making it adapt to or cater to the current speaker the second method is speaker encoding there are no pre trained models here we train two models the generator model and the speaker encoder model simultaneously before getting into the details of these processes I'm gonna have to explain a few concepts so that we're all on the same page let's start with something easy voice cloning voice cloning involves reproducing the voice of an unseen speaker it's a beautiful day isn't it I'm going to conquer the world whether you like it or not okay [Music] we perform voice cloning with only a few samples and this is considered few shot generative modeling of speech in other words we only require a few samples of such speech in order to clone that speech future generative modeling is challenging because it requires to learn speaker characteristics with just limited amounts of data let's now take a look at another term generative models these are distributions that can be sampled from and such samples correspond to real data for example consider a generative model that models animal images then sampling from this I should be able to get the image of say a dog next time I sample from it I may get the image of a cat for the current problem of voice cloning the generative model models speech so sampling from this model gives some speech audio by some speaker and every time I sample from this model I can get different speaker audios now that you know that generator models generate data I think you could guess what needs to be done to create a voice cloning AI we need to train the generator model so that it can sample speech audio from it this model is trained on multiple speakers with multiple accents if you want to get into details this paper uses the libery speech data set which consists of about 2500 speakers and 820 hours of data to train our model you assume to use text audio pairs however speakers with different accents say the same sentence in different ways hence multiple spectrograms will be mapped to the same text this leads to a less accurate generative model however when the model is additionally given information about the speaker such as dialect accent or gender then it is able to model the differences between the speech and hence improve performance this information about the speaker is called speaker embedding and so to train a generative model instead of just text audio pairs we require Triplets of text audio and speaker embeddings for every sample formally define speaker embeddings are low dimension continuous representations of speaker characteristics now that we got the basic terms out of the way let's talk about designing an objective function like I mentioned before we train our generative model to generate audio we call our generative model some F in such parametric models training refers to learning parameters of the model let's call these parameters theta remember we also want to learn an embedding to distinguish between user characteristics like pitch accent and dialect learning is equivalent to estimating a set of parameters let's call the speaker embedding parameters for speaker si as a subscript si what exactly do we give the train model to produce an output we provide two things the first is the text to say here t IJ is the j DH words spoken by speaker i and the second is the identity of the speaker that's si this helps us to model the parameters Ford speaker embedding sampling from this F we get some cloned audio of speaker si saying the word tij for training we have a data set for every speaker as I consisting of some text T and the corresponding actual audio of the speaker AI J saying the word the idea is thus to minimize the divergence between the cloned audio samples from F and the actual audio in the data set for the same speaker this is the loss for just one sample from a single speaker and so we take the expected value of this loss over all speakers s and overall samples in the data set TSI this is done to learn the model parameters theta and the speaker embedding E and this is the general objective function with a written explanation of each so here's a question why do we take the expected value of loss instead of computing the loss directly this is because we don't know how tractable or easy to compute the loss function is it can be and usually is a complex function and hence becomes more feasible to determine the approximate value of the loss and in math this approximation is given by the expected value let us now take a look at the two methods for actually computing this loss and hence performing voice cloning we start with speaker adaptation here is the idea we have a pre trained audio generator we just need to fine-tune it to produce the voice of some unseen speaker given some text even in speaker adaptation there are two approaches of fine-tuning the first is embedding only adaptation and the second is whole model adaptation in the embedding only adaptation approach the only thing we need to do is further train the embedding to cater to a new speaker we don't need to touch the speech generative model so the new loss function can be obtained from the general one we derived since the generative model is pre trained there is no theta estimation we only need some text and the corresponding audio sample spoken by the current speaker sk note that sk is an unseen speaker that the speech generative model f hasn't seen before i put a cap on theta to indicate it's fixed here since the embedding doesn't have nearly as many parameters as the speech model we don't require the new speaker to talk too much as we don't need that much data to model his or her voice let's take a look at some results of this approach first here's the original sample voice we also need a small plastic snake and a big toy frog for the kids now using embedding only adaptation here's the synthesized voice learn about setting up wireless network configuration you can tell the voice is similar to the original speaker let's try something similar but with a male voice this time so here's the original speech some have accepted it as a miracle without physical explanation and here is the synthesized voice using the embedding only adaptation feedback must be timely and accurate throughout the project not bad right the voices are nearly the same so now let's take a look at another speaker adaptive that I mentioned that's used for voice cloning - and that's whole model adaptation we have a pre train model but not only do we fine-tune the speaker embedding as in the case of embedding only approach but we also fine-tune the generative model F itself I'm certain you can imagine the cost function to minimize if you can't tell well that's why I'm here the cost for the embedding only approach is given by this equation but now F is also being tweaked so get rid of that hat over the theta as it is no longer fixed we are predicting both the embedding and theta in the process and that's it let's take a look at this in action here's the original voice Oscar to bring these things with her from the store and here is a synthesized voice when using whole model adaptation both users have opened a massive investigation into allegations of fixing games and illegal betting they sound pretty similar right now we do the same for the male voice here's the original the Greeks used to imagine that it was a sign from the gods to foretell war or heavy rain and here's the synthesized voice instead of fixing it they gave it a nickname comparing the two methods we see that the whole model adaptation has more degrees of freedom and hence more flexibility however it can easily overfit when applied to very less speaker data so there's always a trade-off until now we have just looked at the voice cloning phase using speaker adaptation this is actually just one phase of the entire process now let us look at how we actually cloned the voice from the first step to the last the training part Maps the speaker identity to some embedding the text audio embedding triplet is then fed to the model for training initially both the model parameters theta and the speaker embeddings e are initialized randomly with supervised training samples these parameters are gradually learned after this phase we have trained the multi speaker model and we have also trained the multi speaker speech embedding in Phase two we have cloning which we discussed with the speaker adaptation this involves fine-tuning either only the embedding or both the embedding and the model using the cloning samples these cloning samples are collected by sampling the speaker's voice after Phase two we have trained the multi speaker model well if the whole model adaptation was used then we've also catered it to a specific speaker or it's just the same as the output of the first phase if we just use embedding only and the second is well speaker embedding is now catered to the current speaker the third phase is audio generation given an input piece of text the generative model is able to synthesize speech in the voice of the specific speaker with the help of the embedding of course on to the next approach speaker encoding now this method doesn't really involve fine-tuning any model or embedding per se the speaker encoding function G takes in a specific speaker's speech as input that's a subscript SK and it outputs the corresponding speaker embedding ease of script SK here a subscript SK is the set of audio samples taken from the current speaker that is the voice to be cloned this is represented as cloned audios in the figure let us now try to determine the loss function for the speaker encoding approach we have the original loss function but now we have a speaker encoder G to generate the speech embeddings so just substitute that in place of e we are thus able to train the generative model and the speech encoder simultaneously however in practice there are problems in training these generative models from scratch the first is the missing modes problem or mode collapse without enough training data when sampling generative models we may not be able to sample all classes very well to give a concrete example say you trained a generative model to output animals by showing it images of dogs cats and drafts the we don't have enough of giraffe images so every time we sample the generator you only end up with a dog or cat images and we cannot sample giraffe images one way to solve this problem is to get more training data but if we do that then what would be the point of this paper we are trying to perform neural voice cloning with only a few samples right that was the objective so the idea is to use a pre trained generative model hence we have the model parameters including the speech embeddings learned for the multi speaker model the speech encoder is trained from scratch to make sure that we have a custom voice to train this we first sample some speech from our pre trained multi speaker model F this will generate an audio sample of speaker si this audio cone sample is then used as an input to the speaker encoder labeled as cloned audios since we have the speaker embeddings for the speaker from the pre trained multi-speaker model we keep it fixed and hence indicated in blue we compare this embedding to that generated by the encoder and modify the encoders parameters theta subscript encoder in other words the speech encoder is trained so what is a good objective to minimize this encoder training cost a simple l1 law seems to work best once again the hats indicate the fixed values a subscript si hat indicates the speaker embedding created for the speaker by the pre trained multi-speaker model and G is the speaker embedding predicted by the current speaker encoder eventually the speaker encoder can generate appropriate speaker embeddings more cater to the individual now how do we synthesize the speech first text and some cloning audio samples is input to the model then the speaker encoder creates a speech embedding this embedding along with the text is passed to the generator model the audio corresponding to these inputs is sampled and we get the required audio now that we talked about the speaker encoder model for voice cloning it's on to the next topic what exactly is the speaker encoder like what does it consist of audio A's are converted into Mel spectrograms these are passed into a pre net which consists of FC layers with an AVO activation this is just for feature transformation next the transform features are passed through convolution blocks to extract temporal features these Conflux have residual connections allowing deeper networks global average pooling summarises the utterance if you want to know exactly how residual connections work and various other convolution neural network architectures check out the eye on the sky or the description down below different audio samples have different amounts of information some of them are valuable others are less so a self attention mechanism is used to determine the weights of audio samples and get aggregate embeddings this is kind of like soft attention where we focus on the important parts for more information on attention mechanisms and its types I have a video for that too the output is the predicted speaker embedding for the audio let's take a look at some results here here's the original speakers voice they had four children together now after training a generator model and voice cloning using speaker encoding here is a generated sample in the same voice churches should not encourage it or make it look harmless let's listen to similar results for a male voice versus the original speakers voice it was even worse than at home and here's the corn voice using speaker encoding saying something else different telescope designs perform differently and have different strengths and weaknesses that's pretty cool if you ask me in this video we took a look at a paper released by Baidu on neural voice cloning with a few samples the idea is to clone an unseen speakers voice with only a few sound clips the entire speech synthesis process involves three steps first is training the multi speaker generative model and speaker embedding the second is vocal cloning and the third is synthesizing voice given text in phase 2 that is vocal cloning it is carried out using two approaches the first that we discussed is speaker adaptation where we just either fine-tune the embedding only or fine-tune the generative model and the embedding to cater to the speaker the second approach was speaker encoding where we trained a speaker encoder to accurately model speaker embeddings I encourage you to read the paper yourself to understand extra details the link to it is down in the description below the video I understand that many people are deterred because of the complex math in these papers however I hope that my video helps make the paper more accessible and it bridges the gap between complex math and concept there are fascinating works published every week on this topic and I'm here to make it more accessible if you like this content hit that like button if you want to watch similar content hit that subscribe button and hit the bell icon too I'm trying out this new setup with the camera and the microphone so just let me know how you like it in the comments down below and the links to it will also be in the description down below so if you want to get yourself your own camera or your own microphone it's all there still not satisfied click our talk one of the videos right there and it'll take you to another awesome video and I will see you in next one by

Original Description

In this video, we take a look at a paper released by Baidu on Neural Voice Cloning with a few samples. The idea is to “clone” an unseen speaker’s voice with only a few sound clips. If you like the video, hit that like button. Ring the bell to stay notified of my videos on Machine Learning, Deep Learning, Data Sciences and AI. main paper: https://arxiv.org/abs/1802.06006 Check out the audio demos: https://audiodemos.github.io/ MY EQUIPMENT (on a $350 budget) Camera (GoPro Hero 5 Black + 32 GB Memory + Kit): https://goo.gl/V4542j Microphone: https://goo.gl/BxBRcW Pop filter: https://goo.gl/oQTQ8W FOLLOW ME https://www.quora.com/profile/Ajay-Halthor

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from CodeEmporium · CodeEmporium · 10 of 60

← Previous Next →

Linear Regression and Multiple Regression

Linear Regression and Multiple Regression

Logistic Regression - THE MATH YOU SHOULD KNOW!

Logistic Regression - THE MATH YOU SHOULD KNOW!

Generative Adversarial Networks - FUTURISTIC & FUN AI !

Generative Adversarial Networks - FUTURISTIC & FUN AI !

Deep Learning on the Cloud - GPU TO LEARN FASTER

Deep Learning on the Cloud - GPU TO LEARN FASTER

Deep Mind's AlphaGo Zero - EXPLAINED

Deep Mind's AlphaGo Zero - EXPLAINED

Mask Region based Convolution Neural Networks - EXPLAINED!

Mask Region based Convolution Neural Networks - EXPLAINED!

Attention in Neural Networks

Attention in Neural Networks

Depthwise Separable Convolution - A FASTER CONVOLUTION!

Depthwise Separable Convolution - A FASTER CONVOLUTION!

One Neural network learns EVERYTHING ?!

One Neural network learns EVERYTHING ?!

Neural Voice Cloning

Neural Voice Cloning

AI creates Image Classifiers…by DRAWING?

AI creates Image Classifiers…by DRAWING?

Unpaired Image-Image Translation using CycleGANs

Unpaired Image-Image Translation using CycleGANs

K-Means Clustering - EXPLAINED!

K-Means Clustering - EXPLAINED!

Random Forest Classification

Random Forest Classification

Data Science in Finance

Data Science in Finance

Hypothesis testing with Applications in Data Science

Hypothesis testing with Applications in Data Science

A/B Testing - Simply Explained

A/B Testing - Simply Explained

The Kernel Trick - THE MATH YOU SHOULD KNOW!

The Kernel Trick - THE MATH YOU SHOULD KNOW!

Support Vector Machines - THE MATH YOU SHOULD KNOW

Support Vector Machines - THE MATH YOU SHOULD KNOW

Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!

Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!

History of Calculus - Animated

History of Calculus - Animated

Curiosity in AI

Curiosity in AI

DropBlock - A BETTER DROPOUT for Neural Networks

DropBlock - A BETTER DROPOUT for Neural Networks

Autoencoders - EXPLAINED

Autoencoders - EXPLAINED

Recurrent Neural Networks - EXPLAINED!

Recurrent Neural Networks - EXPLAINED!

LSTM Networks - EXPLAINED!

LSTM Networks - EXPLAINED!

Building an Image Captioner with Neural Networks

Building an Image Captioner with Neural Networks

10 Machine Learning Questions - ANSWERED!

10 Machine Learning Questions - ANSWERED!

How do neural networks work?

How do neural networks work?

Evolution of Face Generation | Evolution of GANs

Evolution of Face Generation | Evolution of GANs

How does Google Translate's AI work?

How does Google Translate's AI work?

How to keep up with AI research?

How to keep up with AI research?

How does YouTube recommend videos? - AI EXPLAINED!

How does YouTube recommend videos? - AI EXPLAINED!

Variational Autoencoders - EXPLAINED!

Variational Autoencoders - EXPLAINED!

Logistic Regression - VISUALIZED!

Logistic Regression - VISUALIZED!

Gradient Descent - THE MATH YOU SHOULD KNOW

Gradient Descent - THE MATH YOU SHOULD KNOW

Boosting - EXPLAINED!

Boosting - EXPLAINED!

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

Loss Functions - EXPLAINED!

Loss Functions - EXPLAINED!

Optimizers - EXPLAINED!

Optimizers - EXPLAINED!

NLP with Neural Networks & Transformers

NLP with Neural Networks & Transformers

Batch Normalization - EXPLAINED!

Batch Normalization - EXPLAINED!

Activation Functions - EXPLAINED!

Activation Functions - EXPLAINED!

Data Scientist Answers Interview Questions

Data Scientist Answers Interview Questions

Why use GPU with Neural Networks?

Why use GPU with Neural Networks?

How do GPUs speed up Neural Network training?

How do GPUs speed up Neural Network training?

BERT Neural Network - EXPLAINED!

BERT Neural Network - EXPLAINED!

ConvNets Scaled Efficiently

ConvNets Scaled Efficiently

Transformer Neural Net makes music! (JukeboxAI)

Transformer Neural Net makes music! (JukeboxAI)

What do filters of Convolution Neural Network learn?

What do filters of Convolution Neural Network learn?

We're hosting a Machine Learning Conference!

We're hosting a Machine Learning Conference!

MLconfEU 2020: Machine Learning Conference for Software Engineers

MLconfEU 2020: Machine Learning Conference for Software Engineers

Are Neural Networks Intelligent?

Are Neural Networks Intelligent?

Time Series Forecasting with Machine Learning

Time Series Forecasting with Machine Learning

Few Shot Learning - EXPLAINED!

Few Shot Learning - EXPLAINED!

How does a Data Scientist Fight FRAUD?

How does a Data Scientist Fight FRAUD?

How would a Data Scientist analyze Customer Churn?

How would a Data Scientist analyze Customer Churn?

Expectations with Machine Learning

Expectations with Machine Learning

Why Logistic Regression DOESN'T return probabilities?!

Why Logistic Regression DOESN'T return probabilities?!

How you SHOULD code Machine Learning

How you SHOULD code Machine Learning

This video introduces Neural Voice Cloning, a technology that can clone a speaker's voice with just a few sound clips, and explores its applications and implications.

Key Takeaways

Read the research paper on arXiv
Explore the audio demos on GitHub
Understand the basics of Neural Voice Cloning
Apply Machine Learning concepts to voice cloning
Implement a simple neural network for voice cloning

💡 Neural Voice Cloning can be achieved with just a few sound clips, making it a potentially powerful tool for speech recognition and synthesis.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Neural Network Basics

View skill →

How to Use Tensorflow for Classification (LIVE)

How to Use Tensorflow for Classification (LIVE)

Complete Implementation Of Perceptron In Deep Learning Using Python From Scratch

Complete Implementation Of Perceptron In Deep Learning Using Python From Scratch

How to Make a Neural Network (LIVE)

How to Make a Neural Network (LIVE)

How to Make a Tensorflow Neural Network (LIVE)

How to Make a Tensorflow Neural Network (LIVE)

Identify Horses or Humans with TensorFlow and Vertex AI

Understanding AI from Scratch – Neural Networks Course

Understanding AI from Scratch – Neural Networks Course

freeCodeCamp.org

Related AI Lessons

Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry

Learn how neural geometry relies on manifolds, projections, and hidden assumptions to understand complex data, and why it matters for AI development

Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry

Learn how neural geometry relies on manifolds, projections, and hidden assumptions to understand complex data, and why it matters for advancing AI research

Medium · Data Science

Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry

Explore the geometric assumptions underlying neural networks and their implications on manifold learning and projections

Medium · Deep Learning

Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry

Learn about the hidden assumptions of neural geometry and how manifolds and projections impact neural network performance

Machine Learning Project for Final Year Students | ML Project Idea @FameWorldEducationalHub

FAME WORLD EDUCATIONAL HUB