Training BERT #2 - Train With Masked-Language Modeling (MLM)

James Briggs · Intermediate ·🧠 Large Language Models ·5y ago

Skills: LLM Foundations90%Fine-tuning LLMs80%

Key Takeaways

The video demonstrates training BERT with masked-language modeling (MLM) using PyTorch and Hugging Face's Transformers library, showcasing the pre-training process and fine-tuning of the model on a custom dataset.

Full Transcript

okay in this video what we're going to do is take a look at how we would train a model a transform model using mass language modeling or mlm now mlm typically would use it when we want to teach a transform model like bert to better understand the specific style of language in our specific use cases and it consists of taking a input sentence or sequence masking a few of the tokens within that input sequence and asking bert to predict the words that we have masked so this is pretty useful because we can take any chunk of text and process it through a masking function and we can use that for training we don't need to get labeled data which is really really useful so let's jump straight into it and what we first need to do is import everything we need so we need our tokenizer and model from transformers and we also need to import pytorch so do from transformers import but tokenizer and bet for mass lm then we also want to import torch and then what we want to do is initialize our tokenizer and model so our tokenizer is a tokenizer from pre-trained and we're using the bert base uncased model just copy that and our model will be pretty similar so this time it's using but for must lm master lm is just mass language modeling or mlm that i mentioned before so it's great now i'm going to be training this on a book that you can just get from the internet's meditations by marcus aurelius the language in that is pretty unique so i figured this is it's quite a good example so i already have it downloaded and i've cleaned up a little bit and i will include a link to that clean version of this so you can follow along if you want so for me of course i already have downloaded it here meditations clean.text and we are reading that in p and all i need to do here is read and what i've done is split each paragraph within meditations by a new line character so i will just split by a new line and that should get us what we want okay so we have this text now and what i want to do with this is actually tokenize it and this is just like we normally would with the transdom's library so we have our tokenizer up here and we just pass our text into that now we're using pytorch here so we want to return pi torch tensors pt and we also need to set the maximum length which for despair model is 512 and then we need to set truncation to [Music] true and padding equal to max length so this will either truncate or pad each one of these sentences to the length of 512 tokens uh this should be re return tenses so there we go and here we are so we still have our input ids we don't need to worry about token type ids here and we have our attention mask which bear just uses for calculating attention i'm not going to really go into depth on any of that now as i said before we need two things for training our birth model here we need the input ids which will have a mask token now we haven't created that mass token yet and we also have our output labels which will not include that mass token so before we mask our input ids we need to create a copy of that which we will use as our labels so we write inputs labels and we set that equal to inputs input ids so our input id is tensor and we clone that by first attaching it and then cloning it okay and that that's all we need so we'll have a look at inputs again and now we have input ids at the top and if we go down to the bottom we have a copy of those in this labels tensor now what we need to do is create our mask so with bert when they are pre-training but they use a few rules but at the core of that the main rule is that each token that is not special token has a 15 chance of being masked so when i say special token i mean the separator and classifier tokens which look like this and i'll point those out in in a minute in fact we can have a look here this is our classifier token this one zero one and you see that at the start of every sequence and then at the end here we also have padding tokens we also don't want to mask those so to create that 15 probability for each token what we do is use the torch rand function and we use this to create a tense set of floats that have the equal dimensions to our inputs ids here inputs input ids tensor like so and if we check the shape of brand we see this 507 which is the number of sequences we have and 512 which is the number of tokens that each sequence has so if we were to just take this we see we get the same okay now we have a look in there it's just a set of floats from the value zero up to one now what we want to do is mass roughly fifteen centuries or give each one there's a fifteen percent probability of being massed and the way we do that is mask anything that is under the value 0.15 so for example these ones here they will be masked whereas these ones up here will not be mass to do that all we write is rand and we do less than 0.15 now if we have a look at master ray we'll see that now these values that were less than 0.15 have this true value which is what we'll be using to mask our tokens later on but at the same time if you remember the classifier token is always in the first position within each tensor so here we would have a classified token here too and in fact all of these would also be padding tokens we don't want to mask any of those so what we do to avoid that is we add some extra logic so put down brackets but actually we're going to just first test the logic so i can show you what it's actually doing so we have our inputs input ids okay so these are the padding tokens these are the classified tokens and what we do is just say inputs input ids not equal to one zero one which is our classify token okay and now you see that we get a false wherever there is a classified token and we want to do the same but for our padding and to do that we multiply that so that is essentially adding it to the logic here it's like a like an and statement and now we are removing the padding tokens from that mask and there's one more we can't see here but there's also a separate token which is represented by the token id102 so we also include that in here as well now all of these together we want to add these onto the logic up here okay and now we will get our mask array you see now we have faults wherever we had the padding tokens we have forwards wherever we had the classifier tokens but we still have a few masked tokens in there so we have these true values here okay so that's our masquerade and now what we want to do is take the indices of each true value within each one of these rows of the tensor now let's first do that with just one of them so you can see how it works so you take this one check the shape it should be 5 12 yeah so this is just one row here and what we'll do is we'll say non zero and this will return the indices where we have the well where we have non-zero values like e g to true values but this is like a vector so what we want to do here is flatten that so we do torch flatten now we get almost a list with sort of tensor we want an actual list and we just write two lists okay so now these are the index positions for the true values within this first row but we want to do for every row and to do that all we do here is we just use a for loop so we initialize our selection list here and we say four call it row or four i in mask array shape zero so we must reshape zero let me just show you it's the 507 rows that we have we want to do selection append and then we already have our logic here so we want to append this but we're going to append this for every single one those rows so let's um oh so sorry let's add a range on here and let's have a look at what we get in the selection so we'll just have a look at the first let's go to first five ah um sorry to replace that with i there we go so now we have indices for the first five rows here and we have it for all of them of course we're showing you the first five and there we go that's what we want and then what we want to do is we can just copy this we want to set the values at each one of these indices equal to 1 0 3 which is our math token within each row of our input ids tensor so we go inputs input ids then here we need to select those specific values and that is a row i followed by selection so a selection of indices at i as well and we set those equal to one zero three like so and now let's have a look at what we have in our input ids tensor so now we can see we have these mass tokens where we saw the true values before in our masquerade and we haven't touched any declassifier or the padding tokens or the separated tokens which are in there as well now our tensors here are in the crit format but we still need to process them through something called a data loader during training now to process them through a data loader we need to convert them into a pi torsion dataset object and to do that what we're going to do is write a and create a class here which will handle this for us so it's going to be meditation's data set and to create the dataset object we need to patch the data set class into here so this is torch utils data data set now there's a few things we need here we need the initialization function which is just in it and we pass self and encodings and here we're just going to assign encodings to a attribute within the this class coding is equal to encodings now the data loader expects two additional functions or methods that is the get item method and the length method length method is so that you can check the length of the data set that it's looking at and to get items so that you can get a dictionary formatted batch of those items so for get item we write this and we need self and then we also specify the index and what we do is we return a dictionary and this is just going to pass so we have we have the input ids key we have the labels key attention masking type ids key it's going to pass those back to the data loader when it requests this get item method so we write torch tensor and we pass the values and the index of those values for key vowel in self in codings dot items so that should should be okay and the only thing left is the length method so we define length and here there's no input parameters all we need to do is return the length of our data set so let's return length and we're doing self encodings and then we can just use any of the tensors that we have in there but we'll do input ids and we could even we could modify this to be like shape zero and get rid of the length at the end there but i'll just stick with length for now so that is our class which will handle the formatting of our data into a dataset object and all we need to do is we write data set so this can be our new dataset variable we have meditation data set our class and then here we just pass our encodings or our inputs like so okay and now we can initialize our data loader which pythonx will be using to load our data during training so we write data loader equals torch utils data and data loader here we want to pass our data set and then we also want to specify our batch size so i'm going to go with 16. you can modify this depending on your your gpu or your your computer whatever however much memory you have and then we also want to shuffle the data within there is also that we're not extracting say the first 16 paragraphs all at once we're actually going to be extracting 16 from random parts of the book okay now we're ready to move on to actually training so first we need to set up the we'll set up all the training parameters so so we first want to move the model to gpu if you have a gpu and we check if we have a gpu using let me show you first so torch device cuda if torch cuda is available else torch device cpu so we're saying here if we have a cuda enabled gpu use that otherwise we just use cpu and you can see here that i do have it so we have this device type cuda and what we'll do is assign that to the device variable here and we use that to move our model and everything across to that device and we do that using model to device and we should get a big output here and get all of this information that we don't need to let's look into that now we need to activate our model our model's training mode so we just do model train let's make sure it's ready and the final thing before we we set up our actual training loop is we need to initialize our optimizer we're going to be using adam with weighted decay here so that's the atom optimizer with weight decay weighted decay it just reduces the chance of overfitting especially with big models like transform models so we're going to do from transformers import atom w and our optimizer is going to be adam w pass in our model parameters and we also need passing a learning rate and we'll do one e to the minus five so model parameters brackets at the end there okay okay now we're fully set up we can actually begin training which is itself as a normal training loop in pi touch and first thing i want to do is just import tqdm this is this allows us to create a progress bar during training otherwise we just sat down we don't see any updates on training which we don't want obviously um i'm going to say so we do two epochs you can obviously modify this as you want i'm just we're just seeing how this all works so i'm not going to train it that much and you want to be careful of training transform models for too many epochs they overfit very easily and we'll do four epoch in range epochs and then here we want to settle our training loop so to do that we want to wrap it within a tqdm function there and we just pass our data loader which what did i call it data loader up here and that leave equals true this just leaves the progress bar rather than placing it with every new epoch and then we run through each batch within our loop so to start batches of 16 items at a time and we first want to initialize initialize our calculated gradients so with every loop we will calculate the gradients and we first we don't want to start with with gradients already calculated we want to initialize them or set them zero so we do optim zero grad then we want to pull all of our tensors that we require for training so input ids of course first one and that will be equal to batch and in here we access our input ids and additionally you see before we move our model to our gpu we also want to do that for our tensors here as well so we say to device okay and we follow this structure for our other tensors as well now for mass language modeling we don't need to do anything with token type ids so we just ignore those we have our attention mask we do need that and we also have our labels which we do need of course and with that we can process everything so now we do outputs model and we pass out input ids inputs we want to specify the attention mask so let's copy that and we also need to specify our labels which is labels okay now let's just extract the loss from those outputs so we get a lost tensor there and what we do here is we use the backward method which calculates loss for every parameter from our in our model and from that we can calculate the gradient update using our optimizer so using that we have optim and record each step and this will take a step to optimize all the weights within our model based on the loss now final little bit here this is just you know aesthetics i want our loop i want to actually see certain bits of information in that loop so all i do is loop set description and here i just want to show the epoch which is just epoch and then i also want to see the loss in the postfix so we do loop set postfix and we do lost loss item so item here just pulls out the exact value within that within that loss uh tensor up here okay that should be everything let's let's go see what we have there we go so now we're training see loss is going down slowly and that's that's it so we're now training our transform model using meditations by marcus aurelius with mass language modeling it's really not that hard i mean there is quite a bit to it but i think once you when you do it it's reasonably straightforward and the fact that you can do this on basically any set of text using just a masking function is incredibly so so useful so we don't need to you know go out looking for labeled data anywhere which is amazing so that's that's it for this video i hope it's been useful i know it's a bit of a long one but thank you very much for watching and i will see you again in the next one

Original Description

🎁 Free NLP for Semantic Search Course: https://www.pinecone.io/learn/nlp BERT has enjoyed unparalleled success in NLP thanks to two unique training approaches, masked-language modeling (MLM), and next sentence prediction (NSP). In many cases, we might be able to take the pre-trained BERT model out-of-the-box and apply it successfully to our own language tasks. But often, we might need to pre-train the model for a specific use case even further. Further training with MLM allows us to tune BERT to better understand the particular use of language in a more specific domain. Out-of-the-box BERT - great for general purpose use. Fine-tuned with MLM BERT - great for domain-specific use. In this video, we'll cover exactly how to fine-tune BERT models using MLM in PyTorch. 👾 Code: https://github.com/jamescalam/transformers/blob/main/course/training/03_mlm_training.ipynb Meditations data: https://github.com/jamescalam/transformers/blob/main/data/text/meditations/clean.txt Understanding MLM: https://youtu.be/q9NS5WpfkrU 🤖 70% Discount on the NLP With Transformers in Python course: https://bit.ly/3DFvvY5 📙 Medium article: https://towardsdatascience.com/masked-language-modelling-with-bert-7d49793e5d2c 🎉 Sign-up For New Articles Every Week on Medium! https://medium.com/@jamescalam/membership 📖 If membership is too expensive - here's a free link: https://towardsdatascience.com/masked-language-modelling-with-bert-7d49793e5d2c?sk=17a19eca8dc8280bea4138802580ffe0 🕹️ Free AI-Powered Code Refactoring with Sourcery: https://sourcery.ai/?utm_source=YouTub&utm_campaign=JBriggs&utm_medium=aff

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from James Briggs · James Briggs · 36 of 60

← Previous Next →

Stoic Philosophy Text Generation with TensorFlow

Stoic Philosophy Text Generation with TensorFlow

How to Build TensorFlow Pipelines with tf.data.Dataset

How to Build TensorFlow Pipelines with tf.data.Dataset

Every New Feature in Python 3.10.0a2

Every New Feature in Python 3.10.0a2

How-to Build a Transformer for Language Classification in TensorFlow

How-to Build a Transformer for Language Classification in TensorFlow

How-to use the Kaggle API in Python

How-to use the Kaggle API in Python

Language Generation with OpenAI's GPT-2 in Python

Language Generation with OpenAI's GPT-2 in Python

Text Summarization with Google AI's T5 in Python

Text Summarization with Google AI's T5 in Python

How-to do Sentiment Analysis with Flair in Python

How-to do Sentiment Analysis with Flair in Python

Python Environment Setup for Machine Learning

Python Environment Setup for Machine Learning

Sequential Model - TensorFlow Essentials #1

Sequential Model - TensorFlow Essentials #1

Functional API - TensorFlow Essentials #2

Functional API - TensorFlow Essentials #2

Training Parameters - TensorFlow Essentials #3

Training Parameters - TensorFlow Essentials #3

Input Data Pipelines - TensorFlow Essentials #4

Input Data Pipelines - TensorFlow Essentials #4

6 of Python's Newest and Best Features (3.7-3.9)

6 of Python's Newest and Best Features (3.7-3.9)

Novice to Advanced RegEx in Less-than 30 Minutes + Python

Novice to Advanced RegEx in Less-than 30 Minutes + Python

Building a PlotLy $GME Chart in Python

Building a PlotLy $GME Chart in Python

How-to Use The Reddit API in Python

How-to Use The Reddit API in Python

How to Build Custom Q&A Transformer Models in Python

How to Build Custom Q&A Transformer Models in Python

How to Build Q&A Models in Python (Transformers)

How to Build Q&A Models in Python (Transformers)

How-to Decode Outputs From NLP Models (Python)

How-to Decode Outputs From NLP Models (Python)

Identify Stocks on Reddit with SpaCy (NER in Python)

Identify Stocks on Reddit with SpaCy (NER in Python)

Sentiment Analysis on ANY Length of Text With Transformers (Python)

Sentiment Analysis on ANY Length of Text With Transformers (Python)

Unicode Normalization for NLP in Python

Unicode Normalization for NLP in Python

The NEW Match-Case Statement in Python 3.10

The NEW Match-Case Statement in Python 3.10

Multi-Class Language Classification With BERT in TensorFlow

Multi-Class Language Classification With BERT in TensorFlow

How to Build Python Packages for Pip

How to Build Python Packages for Pip

How-to Structure a Q&A ML App

How-to Structure a Q&A ML App

How to Index Q&A Data With Haystack and Elasticsearch

How to Index Q&A Data With Haystack and Elasticsearch

Q&A Document Retrieval With DPR

Q&A Document Retrieval With DPR

How to Use Type Annotations in Python

How to Use Type Annotations in Python

Extractive Q&A With Haystack and FastAPI in Python

Extractive Q&A With Haystack and FastAPI in Python

Sentence Similarity With Sentence-Transformers in Python

Sentence Similarity With Sentence-Transformers in Python

Sentence Similarity With Transformers and PyTorch (Python)

Sentence Similarity With Transformers and PyTorch (Python)

NER With Transformers and spaCy (Python)

NER With Transformers and spaCy (Python)

Training BERT #1 - Masked-Language Modeling (MLM)

Training BERT #1 - Masked-Language Modeling (MLM)

Training BERT #2 - Train With Masked-Language Modeling (MLM)

Training BERT #2 - Train With Masked-Language Modeling (MLM)

Training BERT #3 - Next Sentence Prediction (NSP)

Training BERT #3 - Next Sentence Prediction (NSP)

Training BERT #4 - Train With Next Sentence Prediction (NSP)

Training BERT #4 - Train With Next Sentence Prediction (NSP)

FREE 11 Hour NLP Transformers Course (Next 3 Days Only)

FREE 11 Hour NLP Transformers Course (Next 3 Days Only)

New Features in Python 3.10

New Features in Python 3.10

Training BERT #5 - Training With BertForPretraining

Training BERT #5 - Training With BertForPretraining

How-to Use HuggingFace's Datasets - Transformers From Scratch #1

How-to Use HuggingFace's Datasets - Transformers From Scratch #1

Build a Custom Transformer Tokenizer - Transformers From Scratch #2

Build a Custom Transformer Tokenizer - Transformers From Scratch #2

3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)

3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

Building MLM Training Input Pipeline - Transformers From Scratch #3

Building MLM Training Input Pipeline - Transformers From Scratch #3

Training and Testing an Italian BERT - Transformers From Scratch #4

Training and Testing an Italian BERT - Transformers From Scratch #4

Faiss - Introduction to Similarity Search

Faiss - Introduction to Similarity Search

Angular App Setup With Material - Stoic Q&A #5

Angular App Setup With Material - Stoic Q&A #5

Why are there so many Tokenization methods in HF Transformers?

Why are there so many Tokenization methods in HF Transformers?

Choosing Indexes for Similarity Search (Faiss in Python)

Choosing Indexes for Similarity Search (Faiss in Python)

Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)

Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)

How LSH Random Projection works in search (+Python)

How LSH Random Projection works in search (+Python)

IndexLSH for Fast Similarity Search in Faiss

IndexLSH for Fast Similarity Search in Faiss

Faiss - Vector Compression with PQ and IVFPQ (in Python)

Faiss - Vector Compression with PQ and IVFPQ (in Python)

Product Quantization for Vector Similarity Search (+ Python)

Product Quantization for Vector Similarity Search (+ Python)

How to Build a Bert WordPiece Tokenizer in Python and HuggingFace

How to Build a Bert WordPiece Tokenizer in Python and HuggingFace

Metadata Filtering for Vector Search + Latest Filter Tech

Metadata Filtering for Vector Search + Latest Filter Tech

Build NLP Pipelines with HuggingFace Datasets

Build NLP Pipelines with HuggingFace Datasets

Composite Indexes and the Faiss Index Factory

Composite Indexes and the Faiss Index Factory

This video teaches how to train BERT with masked-language modeling using PyTorch and Hugging Face's Transformers library, covering pre-training and fine-tuning on a custom dataset. The practical steps and code snippets provided enable viewers to implement the techniques in their own NLP projects.

Key Takeaways

Import necessary libraries (transformers, pytorch)
Initialize tokenizer and model (BERT base uncased)
Tokenize text using Hugging Face's Transformers library
Create input IDs and attention mask for BERT input
Mask input IDs using a probability of 15% for non-special tokens
Create a custom dataset class to handle data formatting
Use data loader to process data during training
Initialize Adam optimizer with weighted decay and learning rate 1e-5
Train model for 2 epochs with progress bar using tqdm

💡 Masked-language modeling is a key component of BERT's pre-training process, allowing the model to learn contextual relationships between tokens and improve its performance on downstream NLP tasks.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking

Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve model performance

Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking

Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve LLM performance

The 2026 AI Model Release Race: Every Major LLM Launch You Need to Know

Stay updated on the 2026 AI model release race, including major LLM launches like Claude Sonnet 5 and GPT-5.6, to leverage the latest advancements in AI technology

Call GPT, Claude, and Gemini from one API key — a 3-step setup

Access GPT, Claude, and Gemini through one API key with a 3-step setup using Modelishub

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)