BERT Neural Network - EXPLAINED!
Skills:
Unsupervised Learning90%Fine-tuning LLMs80%ML Maths Basics70%Supervised Learning60%ML Pipelines50%
Key Takeaways
The video explains the BERT neural network, a bi-directional encoder representation from the transformer architecture, and its application in natural language processing tasks through fine-tuning and pre-training phases. It covers BERT's ability to learn language by training on unsupervised tasks such as masked language modeling and next sentence prediction, and its fine-tuning for specific tasks like language translation, question answering, and sentiment analysis.
Full Transcript
today we're gonna talk about Bert so let's jump into it this is the transformer neural network architecture that was initially created to solve the problem of language translation this was very well received until this point Ellis TM networks had been used to solve this problem but they had a few problems themselves else TM networks are slow to train words are passed in sequentially and are generated sequentially it can take a significant number of time steps for the neural net to learn and it's not really the best of capturing the true meaning of words yes even bi-directional LS TMS because even here they are technically learning left to right and right to left context separately and then concatenating them so the true context is slightly lost but the transformer architecture addresses some of these concerns first they are faster as words can be processed simultaneously second the context of words is better learned as they can learn context from both directions as simultaneously so for now let's see the transformer in action say we want to train this architecture to convert English to French the transformer consists of two key components an encoder and a decoder the encoder takes the English words simultaneously and it generates embeddings for every word simultaneously these embeddings are vectors that encapsulate the meaning of the word similar words have closer numbers in their vectors the decoder takes these embeddings from the encoder and the previously generated words of the translated french sentence and then it uses them to generate the next french word and we keep generating the french translation one word at a time until the end of sentence is reached what makes this conceptually so much more appealing than some lsdm cell is that we can physically see a separation in tasks the encoder learns what is English what is grammar and more importantly what is context the decoder learns how to English words relate to French words both of these even separately have some underlying understanding of language and it's because of this understanding that we can pick apart this architecture and build systems that understand language we stock the decoders and we get the GPT transformer architecture conversely if we stack just the encoders we get Burt a bi-directional encoder representation from transformer which is exactly what it is the og transformer has language translation on lock but we can use Burt to learn language translation question answering sentiment analysis text summarization and many more tasks turns out all of these problems require the understanding of language so we can train Burt to understand language and then fine tune bird depending on the problem we want to solve as such the training of Burt is done in two phases the first phase is pre-training where the model understands what is language and context and the second phase is fine-tuning where the model learns I know language but how do I solve this problem from here we'll go through pre training and fine-tuning starting at the highest level and then delving further and further into details after every pass so let's go deeper into each phase so pre-training the goal of pre training is to make bert learn what is language and what is context bert learns language by training on two unsupervised tasks simultaneously they are mass language modeling and next sentence prediction for mass language modeling bert takes in a sentence with random words filled with masks the goal is to output these masks tokens and this is kind of like fill in the blanks it helps Bert understand a bi-directional context within a sentence in the case of next sentence prediction Bert takes in two sentences and it determines if the second sentence actually follows the first in kind of what is like a binary classification problem this helps Bert understand context across different sentences themselves and using both of these together Bert gets a good understanding of language great so that's pre-training now the fine-tuning phase so we can now further train Bert on very specific NLP tasks for example let's take question answering all we need to do is replace the fully connected output layers of the network with a fresh set of output layers that can basically output the answer to the question we want then we can perform supervised training using a question answering data set it won't take long since it's only the output parameters that are learned from scratch the rest of the model parameters are just slightly fine-tuned and as a result training time is fast and we can do this for any NLP problem that is replace the output layers and then train with a specific data set okay so that's passed one of the explanation on pre training and fine tuning let's go on to pass two with some more details during Bert pre-training we trained on mass language modeling and next sentence prediction in practice both of these problems are trained simultaneously the input is a set of two sentences with some of the words being masked each token is a word and we convert each of these words into embeddings using pre trained embeddings this provides a good starting point for Bert to work with now on the output side c is the binary output for the next sentence prediction so it would output 1 if sentence B follows sentence a in context and 0 if sentence B doesn't follow sentence a each of the T's here are word vectors that correspond to the outputs for the language model problem so the number of word vectors that we input is the same as the number of word vectors that we output now on the fine tuning phase though if we wanted to perform question-answering we would train the model by modifying the inputs and the output layer we pass in the question followed by a passage containing the answer as inputs and in the output layer we would output these start and the N words that encapsulate the answer assuming that the answer is within the same span of text now that's passed to of the explanation now for past three where we dive further into details this is going to be fun on the input side how are we going to generate these embeddings from the word token inputs well the initial embedding is constructed from three vectors the token embeddings are the pre-trained embeddings the main paper uses word piece embeddings that have a vocabulary of 30,000 tokens the segment embeddings is basically the sentence number that is encoded into a vector and the position embeddings is the position of a word within that sentence that is encoded into a vector adding these three vectors together we get an embedding vector that we use as input to Bert the segment and position embeddings are required for temporal ordering since all these vectors are fed in simultaneously into bird and language models need this ordering preserved cool the input is starting to piece together pretty well let's go to the output side now the output is a binary value C and a bunch of word vectors but with training we need to minimize a loss so two key details to note here all of these word vectors have the same size and all of these word vectors are generated simultaneously we need to take each word vector pass it into a fully connected layered output with the same number of neurons equal to the number of tokens in the vocabulary so that would be an output layer corresponding to 30,000 neurons in this case and we would apply a soft max activation this way we would convert a word vector to a distribution and the actual label of this distribution would be a one hot encoded vector for the actual word and so we compare these two distributions and then train the network using the cross entropy loss but note that the output has all the words even though those inputs weren't masked at all the loss though only considers the prediction of the masked words and it ignores all the other words that are output by the network this is done to ensure that more focus is given to predicting these mass values so that it gets them correct and it increases context awareness so that was a three passes of explaining the pre-training and fine tuning of bird so let's put this all together we pre train bert with mass language modeling and next sentence prediction for every word we get the token embedding from the pre trained word piece embeddings add the position and segment embeddings to account for the ordering of the inputs these are then passed into bert which under the hood is a stack of transformer encoders and it outputs a bunch of forward vectors for mass language modeling and a binary value for an extended prediction the word vectors are then converted into a distribution to Train using cross entropy loss once training is complete Bert has some notion of language it's a language model the next step is the fine-tuning phase where we perform a supervised training depending on the task we want to solve and this should happen fast in fact the Bert squad that is the Stanford question-and-answer model only takes about 30 minutes to fine-tune from a language model for a 91% performance of course performance depends on how big we want Bert to be now the Burton large model which has 340 million parameters can achieve way higher accuracies than the bird base model which only has 110 parameters there's so much more to address about the internals of Berk that I could go on forever but for now I hope this explanation was good to get you an idea of what Burt really does under the hood for more details on the transformer neural network architecture which is the foundations of bird itself click on this video subscribe and stay safe a lot more content coming your way soon and I'll see you soon buh bye
Original Description
Understand the BERT Transformer in and out.
Follow me on M E D I U M: https://towardsdatascience.com/likelihood-probability-and-the-math-you-should-know-9bf66db5241b
Please subscribe to keep me alive: https://www.youtube.com/c/CodeEmporium?sub_confirmation=1
PLAYLISTS FROM MY CHANNEL
⭕ Reinforcement Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9kS--NgVz0EPNyEmygV1Ha&si=AuThDZJwG19cgTA8
Natural Language Processing: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE&si=LsVy8RDPu8jeO-cc
⭕ Transformers from Scratch: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE
⭕ ChatGPT Playlist: https://youtube.com/playlist?list=PLTl9hO2Oobd9coYT6XsTraTBo4pL1j4HJ
⭕ Convolutional Neural Networks: https://youtube.com/playlist?list=PLTl9hO2Oobd9U0XHz62Lw6EgIMkQpfz74
⭕ The Math You Should Know : https://youtube.com/playlist?list=PLTl9hO2Oobd-_5sGLnbgE8Poer1Xjzz4h
⭕ Probability Theory for Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9bPcq0fj91Jgk_-h1H_W3V
⭕ Coding Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd82vcsOnvCNzxrZOlrz3RiD
MATH COURSES (7 day free trial)
📕 Mathematics for Machine Learning: https://imp.i384100.net/MathML
📕 Calculus: https://imp.i384100.net/Calculus
📕 Statistics for Data Science: https://imp.i384100.net/AdvancedStatistics
📕 Bayesian Statistics: https://imp.i384100.net/BayesianStatistics
📕 Linear Algebra: https://imp.i384100.net/LinearAlgebra
📕 Probability: https://imp.i384100.net/Probability
OTHER RELATED COURSES (7 day free trial)
📕 ⭐ Deep Learning Specialization: https://imp.i384100.net/Deep-Learning
📕 Python for Everybody: https://imp.i384100.net/python
📕 MLOps Course: https://imp.i384100.net/MLOps
📕 Natural Language Processing (NLP): https://imp.i384100.net/NLP
📕 Machine Learning in Production: https://imp.i384100.net/MLProduction
📕 Data Science Specialization: https://imp.i384100.net/DataScience
📕 Tensorflow: https://imp.i384100.net/Tensorflow
REFERENC
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from CodeEmporium · CodeEmporium · 47 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
▶
48
49
50
51
52
53
54
55
56
57
58
59
60
Linear Regression and Multiple Regression
CodeEmporium
Logistic Regression - THE MATH YOU SHOULD KNOW!
CodeEmporium
Generative Adversarial Networks - FUTURISTIC & FUN AI !
CodeEmporium
Deep Learning on the Cloud - GPU TO LEARN FASTER
CodeEmporium
Deep Mind's AlphaGo Zero - EXPLAINED
CodeEmporium
Mask Region based Convolution Neural Networks - EXPLAINED!
CodeEmporium
Attention in Neural Networks
CodeEmporium
Depthwise Separable Convolution - A FASTER CONVOLUTION!
CodeEmporium
One Neural network learns EVERYTHING ?!
CodeEmporium
Neural Voice Cloning
CodeEmporium
AI creates Image Classifiers…by DRAWING?
CodeEmporium
Unpaired Image-Image Translation using CycleGANs
CodeEmporium
K-Means Clustering - EXPLAINED!
CodeEmporium
Random Forest Classification
CodeEmporium
Data Science in Finance
CodeEmporium
Hypothesis testing with Applications in Data Science
CodeEmporium
A/B Testing - Simply Explained
CodeEmporium
The Kernel Trick - THE MATH YOU SHOULD KNOW!
CodeEmporium
Support Vector Machines - THE MATH YOU SHOULD KNOW
CodeEmporium
Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!
CodeEmporium
History of Calculus - Animated
CodeEmporium
Curiosity in AI
CodeEmporium
DropBlock - A BETTER DROPOUT for Neural Networks
CodeEmporium
Autoencoders - EXPLAINED
CodeEmporium
Recurrent Neural Networks - EXPLAINED!
CodeEmporium
LSTM Networks - EXPLAINED!
CodeEmporium
Building an Image Captioner with Neural Networks
CodeEmporium
10 Machine Learning Questions - ANSWERED!
CodeEmporium
How do neural networks work?
CodeEmporium
Evolution of Face Generation | Evolution of GANs
CodeEmporium
How does Google Translate's AI work?
CodeEmporium
How to keep up with AI research?
CodeEmporium
How does YouTube recommend videos? - AI EXPLAINED!
CodeEmporium
Variational Autoencoders - EXPLAINED!
CodeEmporium
Logistic Regression - VISUALIZED!
CodeEmporium
Gradient Descent - THE MATH YOU SHOULD KNOW
CodeEmporium
Boosting - EXPLAINED!
CodeEmporium
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
CodeEmporium
Loss Functions - EXPLAINED!
CodeEmporium
Optimizers - EXPLAINED!
CodeEmporium
NLP with Neural Networks & Transformers
CodeEmporium
Batch Normalization - EXPLAINED!
CodeEmporium
Activation Functions - EXPLAINED!
CodeEmporium
Data Scientist Answers Interview Questions
CodeEmporium
Why use GPU with Neural Networks?
CodeEmporium
How do GPUs speed up Neural Network training?
CodeEmporium
BERT Neural Network - EXPLAINED!
CodeEmporium
ConvNets Scaled Efficiently
CodeEmporium
Transformer Neural Net makes music! (JukeboxAI)
CodeEmporium
What do filters of Convolution Neural Network learn?
CodeEmporium
We're hosting a Machine Learning Conference!
CodeEmporium
MLconfEU 2020: Machine Learning Conference for Software Engineers
CodeEmporium
Are Neural Networks Intelligent?
CodeEmporium
Time Series Forecasting with Machine Learning
CodeEmporium
Few Shot Learning - EXPLAINED!
CodeEmporium
How does a Data Scientist Fight FRAUD?
CodeEmporium
How would a Data Scientist analyze Customer Churn?
CodeEmporium
Expectations with Machine Learning
CodeEmporium
Why Logistic Regression DOESN'T return probabilities?!
CodeEmporium
How you SHOULD code Machine Learning
CodeEmporium
More on: Unsupervised Learning
View skill →
🎓
Tutor Explanation
DeepCamp AI