Neural Voice Cloning
Key Takeaways
The video explores Neural Voice Cloning, a technology developed by Baidu, which allows cloning an unseen speaker's voice with only a few sound clips, as presented in a research paper on arXiv.
Full Transcript
wouldn't it be cool to have an AI that listens to you speak for a few seconds and then is able to say different things in your voice well it exists and before getting into details I'll show you exactly what it can do consider the original audio clip this is how the speaker sounds the regional newspapers have outperformed the national titles listening to a few seconds of audio like this from a speaker the AI will be able to take some text input and can generate new audio saying that text like this the large items are put into containers for disposal so you can make an AI say anything you want and in your voice that's pretty slick right I'm AJ Hathor and in this video we're gonna take a look at how exactly such a neural voice cloning system works stay notified about my videos by clicking that subscribe button and hitting that Bell icon now let's get to it last month researchers at Baidu Silicon Valley AI lab developed a neural voice cloning system this system requires only a few samples of single speaker audio to generate speech in the speaker's voice I stress on the term using a few samples because until now neural networks required copious amounts of data in order to train themselves to actually perform any type of task they need such large amounts of training data because of the thousands and even millions of training parameters that they require to estimate the paper neural voice cloning with a few samples proposes two different methods to perform voice cloning the first is speaker adaptation this involves just tweaking or fine-tuning a pre trained model and hence making it adapt to or cater to the current speaker the second method is speaker encoding there are no pre trained models here we train two models the generator model and the speaker encoder model simultaneously before getting into the details of these processes I'm gonna have to explain a few concepts so that we're all on the same page let's start with something easy voice cloning voice cloning involves reproducing the voice of an unseen speaker it's a beautiful day isn't it I'm going to conquer the world whether you like it or not okay [Music] we perform voice cloning with only a few samples and this is considered few shot generative modeling of speech in other words we only require a few samples of such speech in order to clone that speech future generative modeling is challenging because it requires to learn speaker characteristics with just limited amounts of data let's now take a look at another term generative models these are distributions that can be sampled from and such samples correspond to real data for example consider a generative model that models animal images then sampling from this I should be able to get the image of say a dog next time I sample from it I may get the image of a cat for the current problem of voice cloning the generative model models speech so sampling from this model gives some speech audio by some speaker and every time I sample from this model I can get different speaker audios now that you know that generator models generate data I think you could guess what needs to be done to create a voice cloning AI we need to train the generator model so that it can sample speech audio from it this model is trained on multiple speakers with multiple accents if you want to get into details this paper uses the libery speech data set which consists of about 2500 speakers and 820 hours of data to train our model you assume to use text audio pairs however speakers with different accents say the same sentence in different ways hence multiple spectrograms will be mapped to the same text this leads to a less accurate generative model however when the model is additionally given information about the speaker such as dialect accent or gender then it is able to model the differences between the speech and hence improve performance this information about the speaker is called speaker embedding and so to train a generative model instead of just text audio pairs we require Triplets of text audio and speaker embeddings for every sample formally define speaker embeddings are low dimension continuous representations of speaker characteristics now that we got the basic terms out of the way let's talk about designing an objective function like I mentioned before we train our generative model to generate audio we call our generative model some F in such parametric models training refers to learning parameters of the model let's call these parameters theta remember we also want to learn an embedding to distinguish between user characteristics like pitch accent and dialect learning is equivalent to estimating a set of parameters let's call the speaker embedding parameters for speaker si as a subscript si what exactly do we give the train model to produce an output we provide two things the first is the text to say here t IJ is the j DH words spoken by speaker i and the second is the identity of the speaker that's si this helps us to model the parameters Ford speaker embedding sampling from this F we get some cloned audio of speaker si saying the word tij for training we have a data set for every speaker as I consisting of some text T and the corresponding actual audio of the speaker AI J saying the word the idea is thus to minimize the divergence between the cloned audio samples from F and the actual audio in the data set for the same speaker this is the loss for just one sample from a single speaker and so we take the expected value of this loss over all speakers s and overall samples in the data set TSI this is done to learn the model parameters theta and the speaker embedding E and this is the general objective function with a written explanation of each so here's a question why do we take the expected value of loss instead of computing the loss directly this is because we don't know how tractable or easy to compute the loss function is it can be and usually is a complex function and hence becomes more feasible to determine the approximate value of the loss and in math this approximation is given by the expected value let us now take a look at the two methods for actually computing this loss and hence performing voice cloning we start with speaker adaptation here is the idea we have a pre trained audio generator we just need to fine-tune it to produce the voice of some unseen speaker given some text even in speaker adaptation there are two approaches of fine-tuning the first is embedding only adaptation and the second is whole model adaptation in the embedding only adaptation approach the only thing we need to do is further train the embedding to cater to a new speaker we don't need to touch the speech generative model so the new loss function can be obtained from the general one we derived since the generative model is pre trained there is no theta estimation we only need some text and the corresponding audio sample spoken by the current speaker sk note that sk is an unseen speaker that the speech generative model f hasn't seen before i put a cap on theta to indicate it's fixed here since the embedding doesn't have nearly as many parameters as the speech model we don't require the new speaker to talk too much as we don't need that much data to model his or her voice let's take a look at some results of this approach first here's the original sample voice we also need a small plastic snake and a big toy frog for the kids now using embedding only adaptation here's the synthesized voice learn about setting up wireless network configuration you can tell the voice is similar to the original speaker let's try something similar but with a male voice this time so here's the original speech some have accepted it as a miracle without physical explanation and here is the synthesized voice using the embedding only adaptation feedback must be timely and accurate throughout the project not bad right the voices are nearly the same so now let's take a look at another speaker adaptive that I mentioned that's used for voice cloning - and that's whole model adaptation we have a pre train model but not only do we fine-tune the speaker embedding as in the case of embedding only approach but we also fine-tune the generative model F itself I'm certain you can imagine the cost function to minimize if you can't tell well that's why I'm here the cost for the embedding only approach is given by this equation but now F is also being tweaked so get rid of that hat over the theta as it is no longer fixed we are predicting both the embedding and theta in the process and that's it let's take a look at this in action here's the original voice Oscar to bring these things with her from the store and here is a synthesized voice when using whole model adaptation both users have opened a massive investigation into allegations of fixing games and illegal betting they sound pretty similar right now we do the same for the male voice here's the original the Greeks used to imagine that it was a sign from the gods to foretell war or heavy rain and here's the synthesized voice instead of fixing it they gave it a nickname comparing the two methods we see that the whole model adaptation has more degrees of freedom and hence more flexibility however it can easily overfit when applied to very less speaker data so there's always a trade-off until now we have just looked at the voice cloning phase using speaker adaptation this is actually just one phase of the entire process now let us look at how we actually cloned the voice from the first step to the last the training part Maps the speaker identity to some embedding the text audio embedding triplet is then fed to the model for training initially both the model parameters theta and the speaker embeddings e are initialized randomly with supervised training samples these parameters are gradually learned after this phase we have trained the multi speaker model and we have also trained the multi speaker speech embedding in Phase two we have cloning which we discussed with the speaker adaptation this involves fine-tuning either only the embedding or both the embedding and the model using the cloning samples these cloning samples are collected by sampling the speaker's voice after Phase two we have trained the multi speaker model well if the whole model adaptation was used then we've also catered it to a specific speaker or it's just the same as the output of the first phase if we just use embedding only and the second is well speaker embedding is now catered to the current speaker the third phase is audio generation given an input piece of text the generative model is able to synthesize speech in the voice of the specific speaker with the help of the embedding of course on to the next approach speaker encoding now this method doesn't really involve fine-tuning any model or embedding per se the speaker encoding function G takes in a specific speaker's speech as input that's a subscript SK and it outputs the corresponding speaker embedding ease of script SK here a subscript SK is the set of audio samples taken from the current speaker that is the voice to be cloned this is represented as cloned audios in the figure let us now try to determine the loss function for the speaker encoding approach we have the original loss function but now we have a speaker encoder G to generate the speech embeddings so just substitute that in place of e we are thus able to train the generative model and the speech encoder simultaneously however in practice there are problems in training these generative models from scratch the first is the missing modes problem or mode collapse without enough training data when sampling generative models we may not be able to sample all classes very well to give a concrete example say you trained a generative model to output animals by showing it images of dogs cats and drafts the we don't have enough of giraffe images so every time we sample the generator you only end up with a dog or cat images and we cannot sample giraffe images one way to solve this problem is to get more training data but if we do that then what would be the point of this paper we are trying to perform neural voice cloning with only a few samples right that was the objective so the idea is to use a pre trained generative model hence we have the model parameters including the speech embeddings learned for the multi speaker model the speech encoder is trained from scratch to make sure that we have a custom voice to train this we first sample some speech from our pre trained multi speaker model F this will generate an audio sample of speaker si this audio cone sample is then used as an input to the speaker encoder labeled as cloned audios since we have the speaker embeddings for the speaker from the pre trained multi-speaker model we keep it fixed and hence indicated in blue we compare this embedding to that generated by the encoder and modify the encoders parameters theta subscript encoder in other words the speech encoder is trained so what is a good objective to minimize this encoder training cost a simple l1 law seems to work best once again the hats indicate the fixed values a subscript si hat indicates the speaker embedding created for the speaker by the pre trained multi-speaker model and G is the speaker embedding predicted by the current speaker encoder eventually the speaker encoder can generate appropriate speaker embeddings more cater to the individual now how do we synthesize the speech first text and some cloning audio samples is input to the model then the speaker encoder creates a speech embedding this embedding along with the text is passed to the generator model the audio corresponding to these inputs is sampled and we get the required audio now that we talked about the speaker encoder model for voice cloning it's on to the next topic what exactly is the speaker encoder like what does it consist of audio A's are converted into Mel spectrograms these are passed into a pre net which consists of FC layers with an AVO activation this is just for feature transformation next the transform features are passed through convolution blocks to extract temporal features these Conflux have residual connections allowing deeper networks global average pooling summarises the utterance if you want to know exactly how residual connections work and various other convolution neural network architectures check out the eye on the sky or the description down below different audio samples have different amounts of information some of them are valuable others are less so a self attention mechanism is used to determine the weights of audio samples and get aggregate embeddings this is kind of like soft attention where we focus on the important parts for more information on attention mechanisms and its types I have a video for that too the output is the predicted speaker embedding for the audio let's take a look at some results here here's the original speakers voice they had four children together now after training a generator model and voice cloning using speaker encoding here is a generated sample in the same voice churches should not encourage it or make it look harmless let's listen to similar results for a male voice versus the original speakers voice it was even worse than at home and here's the corn voice using speaker encoding saying something else different telescope designs perform differently and have different strengths and weaknesses that's pretty cool if you ask me in this video we took a look at a paper released by Baidu on neural voice cloning with a few samples the idea is to clone an unseen speakers voice with only a few sound clips the entire speech synthesis process involves three steps first is training the multi speaker generative model and speaker embedding the second is vocal cloning and the third is synthesizing voice given text in phase 2 that is vocal cloning it is carried out using two approaches the first that we discussed is speaker adaptation where we just either fine-tune the embedding only or fine-tune the generative model and the embedding to cater to the speaker the second approach was speaker encoding where we trained a speaker encoder to accurately model speaker embeddings I encourage you to read the paper yourself to understand extra details the link to it is down in the description below the video I understand that many people are deterred because of the complex math in these papers however I hope that my video helps make the paper more accessible and it bridges the gap between complex math and concept there are fascinating works published every week on this topic and I'm here to make it more accessible if you like this content hit that like button if you want to watch similar content hit that subscribe button and hit the bell icon too I'm trying out this new setup with the camera and the microphone so just let me know how you like it in the comments down below and the links to it will also be in the description down below so if you want to get yourself your own camera or your own microphone it's all there still not satisfied click our talk one of the videos right there and it'll take you to another awesome video and I will see you in next one by
Original Description
In this video, we take a look at a paper released by Baidu on Neural Voice Cloning with a few samples. The idea is to “clone” an unseen speaker’s voice with only a few sound clips.
If you like the video, hit that like button. Ring the bell to stay notified of my videos on Machine Learning, Deep Learning, Data Sciences and AI.
main paper: https://arxiv.org/abs/1802.06006
Check out the audio demos: https://audiodemos.github.io/
MY EQUIPMENT (on a $350 budget)
Camera (GoPro Hero 5 Black + 32 GB Memory + Kit): https://goo.gl/V4542j
Microphone: https://goo.gl/BxBRcW
Pop filter: https://goo.gl/oQTQ8W
FOLLOW ME
https://www.quora.com/profile/Ajay-Halthor
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from CodeEmporium · CodeEmporium · 10 of 60
1
2
3
4
5
6
7
8
9
▶
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Linear Regression and Multiple Regression
CodeEmporium
Logistic Regression - THE MATH YOU SHOULD KNOW!
CodeEmporium
Generative Adversarial Networks - FUTURISTIC & FUN AI !
CodeEmporium
Deep Learning on the Cloud - GPU TO LEARN FASTER
CodeEmporium
Deep Mind's AlphaGo Zero - EXPLAINED
CodeEmporium
Mask Region based Convolution Neural Networks - EXPLAINED!
CodeEmporium
Attention in Neural Networks
CodeEmporium
Depthwise Separable Convolution - A FASTER CONVOLUTION!
CodeEmporium
One Neural network learns EVERYTHING ?!
CodeEmporium
Neural Voice Cloning
CodeEmporium
AI creates Image Classifiers…by DRAWING?
CodeEmporium
Unpaired Image-Image Translation using CycleGANs
CodeEmporium
K-Means Clustering - EXPLAINED!
CodeEmporium
Random Forest Classification
CodeEmporium
Data Science in Finance
CodeEmporium
Hypothesis testing with Applications in Data Science
CodeEmporium
A/B Testing - Simply Explained
CodeEmporium
The Kernel Trick - THE MATH YOU SHOULD KNOW!
CodeEmporium
Support Vector Machines - THE MATH YOU SHOULD KNOW
CodeEmporium
Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
CodeEmporium
History of Calculus - Animated
CodeEmporium
Curiosity in AI
CodeEmporium
DropBlock - A BETTER DROPOUT for Neural Networks
CodeEmporium
Autoencoders - EXPLAINED
CodeEmporium
Recurrent Neural Networks - EXPLAINED!
CodeEmporium
LSTM Networks - EXPLAINED!
CodeEmporium
Building an Image Captioner with Neural Networks
CodeEmporium
10 Machine Learning Questions - ANSWERED!
CodeEmporium
How do neural networks work?
CodeEmporium
Evolution of Face Generation | Evolution of GANs
CodeEmporium
How does Google Translate's AI work?
CodeEmporium
How to keep up with AI research?
CodeEmporium
How does YouTube recommend videos? - AI EXPLAINED!
CodeEmporium
Variational Autoencoders - EXPLAINED!
CodeEmporium
Logistic Regression - VISUALIZED!
CodeEmporium
Gradient Descent - THE MATH YOU SHOULD KNOW
CodeEmporium
Boosting - EXPLAINED!
CodeEmporium
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
CodeEmporium
Loss Functions - EXPLAINED!
CodeEmporium
Optimizers - EXPLAINED!
CodeEmporium
NLP with Neural Networks & Transformers
CodeEmporium
Batch Normalization - EXPLAINED!
CodeEmporium
Activation Functions - EXPLAINED!
CodeEmporium
Data Scientist Answers Interview Questions
CodeEmporium
Why use GPU with Neural Networks?
CodeEmporium
How do GPUs speed up Neural Network training?
CodeEmporium
BERT Neural Network - EXPLAINED!
CodeEmporium
ConvNets Scaled Efficiently
CodeEmporium
Transformer Neural Net makes music! (JukeboxAI)
CodeEmporium
What do filters of Convolution Neural Network learn?
CodeEmporium
We're hosting a Machine Learning Conference!
CodeEmporium
MLconfEU 2020: Machine Learning Conference for Software Engineers
CodeEmporium
Are Neural Networks Intelligent?
CodeEmporium
Time Series Forecasting with Machine Learning
CodeEmporium
Few Shot Learning - EXPLAINED!
CodeEmporium
How does a Data Scientist Fight FRAUD?
CodeEmporium
How would a Data Scientist analyze Customer Churn?
CodeEmporium
Expectations with Machine Learning
CodeEmporium
Why Logistic Regression DOESN'T return probabilities?!
CodeEmporium
How you SHOULD code Machine Learning
CodeEmporium
More on: Neural Network Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry
Medium · AI
Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry
Medium · Data Science
Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry
Medium · Deep Learning
Beyond the Elephant: On Manifolds, Projections, and the Hidden Assumptions of Neural Geometry
Medium · LLM
🎓
Tutor Explanation
DeepCamp AI