BERT Neural Network - EXPLAINED!

CodeEmporium · Advanced ·📐 ML Fundamentals ·6y ago

Skills: Unsupervised Learning90%Fine-tuning LLMs80%ML Maths Basics70%Supervised Learning60%ML Pipelines50%

Key Takeaways

The video explains the BERT neural network, a bi-directional encoder representation from the transformer architecture, and its application in natural language processing tasks through fine-tuning and pre-training phases. It covers BERT's ability to learn language by training on unsupervised tasks such as masked language modeling and next sentence prediction, and its fine-tuning for specific tasks like language translation, question answering, and sentiment analysis.

Full Transcript

today we're gonna talk about Bert so let's jump into it this is the transformer neural network architecture that was initially created to solve the problem of language translation this was very well received until this point Ellis TM networks had been used to solve this problem but they had a few problems themselves else TM networks are slow to train words are passed in sequentially and are generated sequentially it can take a significant number of time steps for the neural net to learn and it's not really the best of capturing the true meaning of words yes even bi-directional LS TMS because even here they are technically learning left to right and right to left context separately and then concatenating them so the true context is slightly lost but the transformer architecture addresses some of these concerns first they are faster as words can be processed simultaneously second the context of words is better learned as they can learn context from both directions as simultaneously so for now let's see the transformer in action say we want to train this architecture to convert English to French the transformer consists of two key components an encoder and a decoder the encoder takes the English words simultaneously and it generates embeddings for every word simultaneously these embeddings are vectors that encapsulate the meaning of the word similar words have closer numbers in their vectors the decoder takes these embeddings from the encoder and the previously generated words of the translated french sentence and then it uses them to generate the next french word and we keep generating the french translation one word at a time until the end of sentence is reached what makes this conceptually so much more appealing than some lsdm cell is that we can physically see a separation in tasks the encoder learns what is English what is grammar and more importantly what is context the decoder learns how to English words relate to French words both of these even separately have some underlying understanding of language and it's because of this understanding that we can pick apart this architecture and build systems that understand language we stock the decoders and we get the GPT transformer architecture conversely if we stack just the encoders we get Burt a bi-directional encoder representation from transformer which is exactly what it is the og transformer has language translation on lock but we can use Burt to learn language translation question answering sentiment analysis text summarization and many more tasks turns out all of these problems require the understanding of language so we can train Burt to understand language and then fine tune bird depending on the problem we want to solve as such the training of Burt is done in two phases the first phase is pre-training where the model understands what is language and context and the second phase is fine-tuning where the model learns I know language but how do I solve this problem from here we'll go through pre training and fine-tuning starting at the highest level and then delving further and further into details after every pass so let's go deeper into each phase so pre-training the goal of pre training is to make bert learn what is language and what is context bert learns language by training on two unsupervised tasks simultaneously they are mass language modeling and next sentence prediction for mass language modeling bert takes in a sentence with random words filled with masks the goal is to output these masks tokens and this is kind of like fill in the blanks it helps Bert understand a bi-directional context within a sentence in the case of next sentence prediction Bert takes in two sentences and it determines if the second sentence actually follows the first in kind of what is like a binary classification problem this helps Bert understand context across different sentences themselves and using both of these together Bert gets a good understanding of language great so that's pre-training now the fine-tuning phase so we can now further train Bert on very specific NLP tasks for example let's take question answering all we need to do is replace the fully connected output layers of the network with a fresh set of output layers that can basically output the answer to the question we want then we can perform supervised training using a question answering data set it won't take long since it's only the output parameters that are learned from scratch the rest of the model parameters are just slightly fine-tuned and as a result training time is fast and we can do this for any NLP problem that is replace the output layers and then train with a specific data set okay so that's passed one of the explanation on pre training and fine tuning let's go on to pass two with some more details during Bert pre-training we trained on mass language modeling and next sentence prediction in practice both of these problems are trained simultaneously the input is a set of two sentences with some of the words being masked each token is a word and we convert each of these words into embeddings using pre trained embeddings this provides a good starting point for Bert to work with now on the output side c is the binary output for the next sentence prediction so it would output 1 if sentence B follows sentence a in context and 0 if sentence B doesn't follow sentence a each of the T's here are word vectors that correspond to the outputs for the language model problem so the number of word vectors that we input is the same as the number of word vectors that we output now on the fine tuning phase though if we wanted to perform question-answering we would train the model by modifying the inputs and the output layer we pass in the question followed by a passage containing the answer as inputs and in the output layer we would output these start and the N words that encapsulate the answer assuming that the answer is within the same span of text now that's passed to of the explanation now for past three where we dive further into details this is going to be fun on the input side how are we going to generate these embeddings from the word token inputs well the initial embedding is constructed from three vectors the token embeddings are the pre-trained embeddings the main paper uses word piece embeddings that have a vocabulary of 30,000 tokens the segment embeddings is basically the sentence number that is encoded into a vector and the position embeddings is the position of a word within that sentence that is encoded into a vector adding these three vectors together we get an embedding vector that we use as input to Bert the segment and position embeddings are required for temporal ordering since all these vectors are fed in simultaneously into bird and language models need this ordering preserved cool the input is starting to piece together pretty well let's go to the output side now the output is a binary value C and a bunch of word vectors but with training we need to minimize a loss so two key details to note here all of these word vectors have the same size and all of these word vectors are generated simultaneously we need to take each word vector pass it into a fully connected layered output with the same number of neurons equal to the number of tokens in the vocabulary so that would be an output layer corresponding to 30,000 neurons in this case and we would apply a soft max activation this way we would convert a word vector to a distribution and the actual label of this distribution would be a one hot encoded vector for the actual word and so we compare these two distributions and then train the network using the cross entropy loss but note that the output has all the words even though those inputs weren't masked at all the loss though only considers the prediction of the masked words and it ignores all the other words that are output by the network this is done to ensure that more focus is given to predicting these mass values so that it gets them correct and it increases context awareness so that was a three passes of explaining the pre-training and fine tuning of bird so let's put this all together we pre train bert with mass language modeling and next sentence prediction for every word we get the token embedding from the pre trained word piece embeddings add the position and segment embeddings to account for the ordering of the inputs these are then passed into bert which under the hood is a stack of transformer encoders and it outputs a bunch of forward vectors for mass language modeling and a binary value for an extended prediction the word vectors are then converted into a distribution to Train using cross entropy loss once training is complete Bert has some notion of language it's a language model the next step is the fine-tuning phase where we perform a supervised training depending on the task we want to solve and this should happen fast in fact the Bert squad that is the Stanford question-and-answer model only takes about 30 minutes to fine-tune from a language model for a 91% performance of course performance depends on how big we want Bert to be now the Burton large model which has 340 million parameters can achieve way higher accuracies than the bird base model which only has 110 parameters there's so much more to address about the internals of Berk that I could go on forever but for now I hope this explanation was good to get you an idea of what Burt really does under the hood for more details on the transformer neural network architecture which is the foundations of bird itself click on this video subscribe and stay safe a lot more content coming your way soon and I'll see you soon buh bye

Original Description

Understand the BERT Transformer in and out. Follow me on M E D I U M: https://towardsdatascience.com/likelihood-probability-and-the-math-you-should-know-9bf66db5241b Please subscribe to keep me alive: https://www.youtube.com/c/CodeEmporium?sub_confirmation=1 PLAYLISTS FROM MY CHANNEL ⭕ Reinforcement Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9kS--NgVz0EPNyEmygV1Ha&si=AuThDZJwG19cgTA8 Natural Language Processing: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE&si=LsVy8RDPu8jeO-cc ⭕ Transformers from Scratch: https://youtube.com/playlist?list=PLTl9hO2Oobd_bzXUpzKMKA3liq2kj6LfE ⭕ ChatGPT Playlist: https://youtube.com/playlist?list=PLTl9hO2Oobd9coYT6XsTraTBo4pL1j4HJ ⭕ Convolutional Neural Networks: https://youtube.com/playlist?list=PLTl9hO2Oobd9U0XHz62Lw6EgIMkQpfz74 ⭕ The Math You Should Know : https://youtube.com/playlist?list=PLTl9hO2Oobd-_5sGLnbgE8Poer1Xjzz4h ⭕ Probability Theory for Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd9bPcq0fj91Jgk_-h1H_W3V ⭕ Coding Machine Learning: https://youtube.com/playlist?list=PLTl9hO2Oobd82vcsOnvCNzxrZOlrz3RiD MATH COURSES (7 day free trial) 📕 Mathematics for Machine Learning: https://imp.i384100.net/MathML 📕 Calculus: https://imp.i384100.net/Calculus 📕 Statistics for Data Science: https://imp.i384100.net/AdvancedStatistics 📕 Bayesian Statistics: https://imp.i384100.net/BayesianStatistics 📕 Linear Algebra: https://imp.i384100.net/LinearAlgebra 📕 Probability: https://imp.i384100.net/Probability OTHER RELATED COURSES (7 day free trial) 📕 ⭐ Deep Learning Specialization: https://imp.i384100.net/Deep-Learning 📕 Python for Everybody: https://imp.i384100.net/python 📕 MLOps Course: https://imp.i384100.net/MLOps 📕 Natural Language Processing (NLP): https://imp.i384100.net/NLP 📕 Machine Learning in Production: https://imp.i384100.net/MLProduction 📕 Data Science Specialization: https://imp.i384100.net/DataScience 📕 Tensorflow: https://imp.i384100.net/Tensorflow REFERENC

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from CodeEmporium · CodeEmporium · 47 of 60

← Previous Next →

Linear Regression and Multiple Regression

Linear Regression and Multiple Regression

Logistic Regression - THE MATH YOU SHOULD KNOW!

Logistic Regression - THE MATH YOU SHOULD KNOW!

Generative Adversarial Networks - FUTURISTIC & FUN AI !

Generative Adversarial Networks - FUTURISTIC & FUN AI !

Deep Learning on the Cloud - GPU TO LEARN FASTER

Deep Learning on the Cloud - GPU TO LEARN FASTER

Deep Mind's AlphaGo Zero - EXPLAINED

Deep Mind's AlphaGo Zero - EXPLAINED

Mask Region based Convolution Neural Networks - EXPLAINED!

Mask Region based Convolution Neural Networks - EXPLAINED!

Attention in Neural Networks

Attention in Neural Networks

Depthwise Separable Convolution - A FASTER CONVOLUTION!

Depthwise Separable Convolution - A FASTER CONVOLUTION!

One Neural network learns EVERYTHING ?!

One Neural network learns EVERYTHING ?!

Neural Voice Cloning

Neural Voice Cloning

AI creates Image Classifiers…by DRAWING?

AI creates Image Classifiers…by DRAWING?

Unpaired Image-Image Translation using CycleGANs

Unpaired Image-Image Translation using CycleGANs

K-Means Clustering - EXPLAINED!

K-Means Clustering - EXPLAINED!

Random Forest Classification

Random Forest Classification

Data Science in Finance

Data Science in Finance

Hypothesis testing with Applications in Data Science

Hypothesis testing with Applications in Data Science

A/B Testing - Simply Explained

A/B Testing - Simply Explained

The Kernel Trick - THE MATH YOU SHOULD KNOW!

The Kernel Trick - THE MATH YOU SHOULD KNOW!

Support Vector Machines - THE MATH YOU SHOULD KNOW

Support Vector Machines - THE MATH YOU SHOULD KNOW

Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!

Principal Component Analysis (PCA) - THE MATH YOU SHOULD KNOW!

History of Calculus - Animated

History of Calculus - Animated

Curiosity in AI

Curiosity in AI

DropBlock - A BETTER DROPOUT for Neural Networks

DropBlock - A BETTER DROPOUT for Neural Networks

Autoencoders - EXPLAINED

Autoencoders - EXPLAINED

Recurrent Neural Networks - EXPLAINED!

Recurrent Neural Networks - EXPLAINED!

LSTM Networks - EXPLAINED!

LSTM Networks - EXPLAINED!

Building an Image Captioner with Neural Networks

Building an Image Captioner with Neural Networks

10 Machine Learning Questions - ANSWERED!

10 Machine Learning Questions - ANSWERED!

How do neural networks work?

How do neural networks work?

Evolution of Face Generation | Evolution of GANs

Evolution of Face Generation | Evolution of GANs

How does Google Translate's AI work?

How does Google Translate's AI work?

How to keep up with AI research?

How to keep up with AI research?

How does YouTube recommend videos? - AI EXPLAINED!

How does YouTube recommend videos? - AI EXPLAINED!

Variational Autoencoders - EXPLAINED!

Variational Autoencoders - EXPLAINED!

Logistic Regression - VISUALIZED!

Logistic Regression - VISUALIZED!

Gradient Descent - THE MATH YOU SHOULD KNOW

Gradient Descent - THE MATH YOU SHOULD KNOW

Boosting - EXPLAINED!

Boosting - EXPLAINED!

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

Transformer Neural Networks - EXPLAINED! (Attention is all you need)

Loss Functions - EXPLAINED!

Loss Functions - EXPLAINED!

Optimizers - EXPLAINED!

Optimizers - EXPLAINED!

NLP with Neural Networks & Transformers

NLP with Neural Networks & Transformers

Batch Normalization - EXPLAINED!

Batch Normalization - EXPLAINED!

Activation Functions - EXPLAINED!

Activation Functions - EXPLAINED!

Data Scientist Answers Interview Questions

Data Scientist Answers Interview Questions

Why use GPU with Neural Networks?

Why use GPU with Neural Networks?

How do GPUs speed up Neural Network training?

How do GPUs speed up Neural Network training?

BERT Neural Network - EXPLAINED!

BERT Neural Network - EXPLAINED!

ConvNets Scaled Efficiently

ConvNets Scaled Efficiently

Transformer Neural Net makes music! (JukeboxAI)

Transformer Neural Net makes music! (JukeboxAI)

What do filters of Convolution Neural Network learn?

What do filters of Convolution Neural Network learn?

We're hosting a Machine Learning Conference!

We're hosting a Machine Learning Conference!

MLconfEU 2020: Machine Learning Conference for Software Engineers

MLconfEU 2020: Machine Learning Conference for Software Engineers

Are Neural Networks Intelligent?

Are Neural Networks Intelligent?

Time Series Forecasting with Machine Learning

Time Series Forecasting with Machine Learning

Few Shot Learning - EXPLAINED!

Few Shot Learning - EXPLAINED!

How does a Data Scientist Fight FRAUD?

How does a Data Scientist Fight FRAUD?

How would a Data Scientist analyze Customer Churn?

How would a Data Scientist analyze Customer Churn?

Expectations with Machine Learning

Expectations with Machine Learning

Why Logistic Regression DOESN'T return probabilities?!

Why Logistic Regression DOESN'T return probabilities?!

How you SHOULD code Machine Learning

How you SHOULD code Machine Learning

This video provides an in-depth explanation of the BERT neural network, its architecture, and its application in NLP tasks. Viewers will learn how BERT learns language through masked language modeling and next sentence prediction, and how it can be fine-tuned for specific tasks. The video covers the mathematical foundations of BERT and its practical applications in NLP.

Key Takeaways

Understand the transformer architecture and its application in BERT
Implement masked language modeling and next sentence prediction for language understanding
Fine-tune pre-trained BERT models for specific NLP tasks
Apply mathematical concepts to NLP tasks
Design and implement ML pipelines for NLP tasks

💡 BERT's ability to learn language through masked language modeling and next sentence prediction makes it a powerful tool for NLP tasks, and its fine-tuning capabilities allow for adaptation to specific tasks and domains.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Unsupervised Learning

View skill →

How to implement K-Means from scratch with Python

How to implement K-Means from scratch with Python

K-Means Clustering - The Math of Intelligence (Week 3)

K-Means Clustering - The Math of Intelligence (Week 3)

Mean Shift with Titanic Dataset - Practical Machine Learning Tutorial with Python p.40

Mean Shift with Titanic Dataset - Practical Machine Learning Tutorial with Python p.40

Self-/Unsupervised GNN Training

Self-/Unsupervised GNN Training

Statistical Learning: 12.R.3 Hierarchical Clustering

Statistical Learning: 12.R.3 Hierarchical Clustering

Stanford Online

Clustering with DBSCAN, Clearly Explained!!!

Clustering with DBSCAN, Clearly Explained!!!

StatQuest with Josh Starmer

Related Reads

Build a Simple Calculator

Learn to build a simple calculator using Python and apply basic programming concepts to a real-world project

Medium · Python

Building ML APIs That Don’t Fail During Startup

Learn how to build ML APIs that don't fail during startup by using a production-ready pattern for loading ML models without serving requests too early

Medium · Python

Your Model’s Numbers Just Changed. Git Never Noticed.

Learn how to track changes in your model's data using Data Version Control (DVC) to ensure reproducibility and accuracy

Medium · Machine Learning

Your Model’s Numbers Just Changed. Git Never Noticed.

Learn how to track changes in your machine learning model's data with Data Version Control, a crucial step in ensuring reproducibility and collaboration

Medium · DevOps

Is Python Dead in 2026?| Truth About Python in AI Era | 90 Days Roadmap @FameWorldEducationalHub

FAME WORLD EDUCATIONAL HUB