Training BERT #5 - Training With BertForPretraining

James Briggs · Advanced ·🧠 Large Language Models ·5y ago

Skills: LLM Foundations90%Fine-tuning LLMs80%Multimodal LLMs60%Prompt Craft50%Prompting Basics50%

Key Takeaways

The video demonstrates training BERT with MLM and NSP for pre-training, fine-tuning BERT for specific use cases, and utilizing tools like Hugging Face Transformers, PyTorch, and torch.utils.data.DataLoader.

Full Transcript

hi welcome to the video here we're going to have a look at how we can pre-train birds so what i mean by pre-train is fine-tuned bert using the same approaches that are used to actually pre-train but itself so we would use these when we want to teach bert to better understand the style of language in our specific use cases so we'll jump straight into it but what we're going to see is essentially two different methods applied together so when we're pre-training we're using something called mass language modeling or mlm and also net sentence prediction or nsp now in a few previous videos i've covered all of these so if you do want to go into a little more depth then i would definitely recommend having a look at those but in this video we're just going to go straight into actually training a bet model using both of those methods using the pre-training class so we need first to import everything that we need so i'm going to import requests because i'm going to use request download data we're using which is from here you'll find a link in the description for that and we also need to import our tokenizer and model classes from transformers so from transformers we're going to import the tokenizer and also bert for pre-training now like i said before this bert for pre-training class contains both an mlm head and an nsb head so once we have that oh we also need to import torch as well so let me import torch once we have that we can initialize our tokenizer and model so we initialize our tokenizer like this so but tokenizer and it's from pre-trained and we're going to be using the vert base uncased model obviously you can use whichever model you you like and for our model we have the for pre-training class so that's our tokenizer model now let's get our data don't need to worry about that warning it's just telling us that we need to train it basically if we want to use it for inference predictions so we get our data we're we're gonna pull it from here so let me copy that and it's just requests dot get and paste that in there and we should see a 200 code that's good and so we just extract the data using the text attribute so text equals that we also need to split it because it's a set of paragraphs that are split by a newline character and we can see those in here now we need to power data both for nsp and mlm so we'll go with nsp first and to do that we need to create a set of random sentences so sentence a and b where the sentence b is not related to sentence a we need roughly 50 of those and then the other 50 we want it to be sentence a is followed by sentence b so they are more coherent so we're basically teaching birth to distinguish between coherence and non-coherence between sentences so like long-term dependencies and we just want to be aware that within our text so we have this one paragraph that has multiple sentences so we split like this we have notes so we need to create essentially a list of all of the different sentences that we have that we can just pull from when we're creating our our training data for nsp now to do that we're going to use this comprehension here and what we'll do is write sentence so for each sentence for each paragraph in the text so this this variable for sentence in para dot split so this is where we're getting our sentence variable from and we just want to be aware of if we have a look at this one we see we get this this empty sentence we get that for all of our paragraphs so we just want to not include those so we say if sentence is not equal to that empty uh sentence and we're also going to need to get the length of that bag for later as well and now what we do is create our nsp training data so we want that 50-50 split so we're going to use the random library to create a 50 50 randomness we want to initialize a list of sentence a's a list of sentence b's and also a list of labels and then what we do is we're going to loop through each paragraph in our text so for paragraph in text we want to extract each sentence from the paragraph so we're going to use it similar to what we've done here so write sentences and this is going to be lists of all the sentences within each paragraph so sentence for sentence in paragraph dot split by a period character and we also want to make sure we're not including those empty ones so if sentence is not equal to empty then once we're there what we want to do is want to get the number of sentences within each sentence or sentences variable so we just get length and the reason we do that is because we want to check that a couple of times in the next few lines of code and first time we check that is now so we check that the number of sentences is greater than one now this because we're concatenating two sentences to create our training data we don't want to get just one sentence we need it where we have for example in this one we have multiple sentences so that we can select like this sentence followed by this sentence we can't do that with these because there's no guarantee that this paragraph here is going to be talking about the same topic as this paragraph here so we just avoid that and in here first thing we want to do is set our start sentence so this is where sentence a is going to come from and we're going to randomly select say for this example we want to randomly select any of the first one two three sentences okay we'd want to select any of these three but not this one because if this sentence a we don't have a sentence b which follows it to extract so we write random randint zero up to the length of num sentences minus two now we can now get our sentence a which is append and we just write sentences start and then for our sentence b 50 we want to select random one from bag up here fifty percent of time want to select the genuine next sentence so say if random dot random so this was like a random float between zero and one is greater than 0.5 and sentence of b is going to be we'll make this our coherent version so sentences start plus one and that means that our label will have to be zero because that means that these two sentences are coherent sentence b does follow sentence a otherwise we select a random sentence for sentence b so do append and here we would write bag and we need to need to select a random one so we do random same as we did earlier on for the start we do random randint from zero to the length of the bag size minus one so but we also need to do the label which is going to be one in this case we can execute that now that will work i go a little more into depth on this in the previous nsp video so i'll leave a link to that in the description if you want to go through it and now what we can do is tokenize our data so to do that we just write inputs and we use a tokenizer so this is just normal you know hugging phase transformers and we just write sentence a and sentence b so plug-in face transformers will will know what we want to do that will deal with formatting for us which is pretty useful we want to return pi torch tensors so return tensors equals pt and we need to set everything to a max length of 512 tokens so max length equals 12. the truncation needs to be set true and we also need to set padding equal to max length okay so that creates three different tensors for us input ids token type ids and attention mask now for the pre-trained model we need two more tenses we need our next sentence label tensor so to create that we write inputs next sentence label and that needs to be a long tensor containing our labels which we created before in the correct dimensionality so that's why we're using the the list here and the transpose and we can now look at what that creates as well so so look at the first 10 we get that okay and now what we want to do is create our mask data so we need the labels for our mask first so when we do this what we'll do is we're going to clone the input ids tensor we're going to use that clone for the labels tensor and then we're going to go back to our input ids and mask around 15 of the tokens in that tensor so let's create that labels tensor which can be equal to inputs input ids uh detach and clone okay so now we'll see in here we have all of the tensors we need but we still need to mask around 15 of these before moving on to training our model and to do that we'll use we'll create a random array using the torch rand that needs to be in the same shape as our input ids and that will just create a a big tensor of between values of zero up to one and what we want to do is mass around fifteen percent of those so we will write something like this okay and that will give us our mask here but we also don't want to mask special tokens which we are doing here we're masking our classification tokens we're also masking padding tokens up here so we need to add a little bit more logic to that so let me just add this to a variable so we add that logic which says and input ids is not equal to 101 which is our cls token which is what we we get down here so we can actually see the impact see we get faults now and we also want to do the same file separator tokens which is one zero two we can't see any of those and our padding tokens which is zero so you see these are we'll go faults now like so so that's our masking array and now what we want to do is loop through all of these extract the points at which they are not false so where we have the mass and use those indices values to mask our actual input ids up here to that we go for i in range inputs input ids dot shape zero this is like iterating through each row and what we do here is we get selection so these are the indices where we have true values from masquerade and we do that using torch flatten mass array at the given index where they are non-zero and we want to create a list from that okay so we have that um oh and so let me show you what the selection looks like quickly so it's just a selection of indices to mask and we want to apply that to our inputs input ids so at the current index and we select those specific items and we set them equal to 103 which is the masking token id okay so that's our masking and now what we need to do is we need to take all of our data here and load it into a pi torch data loader and to do that we need to reformat our data into a pi torch data set object and we do that here so main thing to know is we pass our data into this initialization that assigns them to the self encodings attribute and then here we say okay given a certain index we want to extract the tenses in a dictionary format for that index and then here we're just passing length so how many uh how many tenses or how many samples we have in the full data set so run that we initialize our date set using that class so right data set equals meditations data set pass our data in there which is inputs and then with that we can create our data loader like this so torch utils data data loader and we have data set okay so that's ready now we need to set up our training loop so first thing we need to do is check if we are on gpu or not if we are we use it and we do that like so so device equals torch device cuda if torch cuda is available else torch device cpu so that's saying use the gpu if we have a cuda enabled gpu otherwise use cpu and then what we want to do is move our model over to that device and we also want to activate the training mode of our model and then we need to initialize our optimizer i'm going to be using adam with weighted decay so from transformers import atom w and initialize it like this so optim equals atom w we pass our model parameters to that and we also pass a learning rate so learning rate is going to be five e to the minus five okay and now we can create our training loop so you can use tqdm to create the the uh the progress bar and we're gonna go through two epochs so for epoc in range two we initialize our loop by wrapping it within tqdm and in here we have our data loader and we set leave equal to true so that we can see that progress bar and then we loop through each batch within that loop um oh up here so i didn't actually set the batches my mistake so up here we want to set where we initialize the data loader i'm going to set batch size equal to 16 and also for the data set as well okay so for batch in loop here we want to initialize the gradients on our optimizer and then we need to load in each of our tensors which there are quite a few of them so we have inputs.keys we need to load in each one of these so input ids equals batch we access this like a dictionary so input ids we also want to move each one of those tensors that we're using to our device so we do that for each one of those and we have tension mass and next sentence labels and also labels okay and now we can actually process that through our model so in here we just need to pass all of these tenses that we have so input ids and then we have token type ids just copy this attention mass next sentence label and labels okay so it's quite a lot going into our model and now what we want to do is extract the loss from that then we calculate loss for every parameter in our model and then using that we can update our gradients using our optimizer and then we want to do is print the relevant info to our progress bar that we set up using tqdm and loop so loop with set description and here i was going to put the epoc info so the epoch we're currently on and then i also want to set the post fix which will contain the loss information so loss dot item okay we can run that and you see that our model is now training so we're now training a model using both us language modeling and net sentence prediction and we haven't needed to take any structured data we've just taken a a book and pulled all data and formatted it in the correct way for us to actually train a better model which i think is really cool so that's it for this video i hope it's been useful and i'll see you in the next one

Original Description

NSP Logic https://youtu.be/1gN1snKBLP0 MLM Logic https://youtu.be/q9NS5WpfkrU 🤖 70% Discount on the NLP With Transformers in Python course: https://bit.ly/3DFvvY5 📙 Medium article: https://towardsdatascience.com/how-to-train-bert-aaad00533168 📖 Here's a free link: https://towardsdatascience.com/how-to-train-bert-aaad00533168?sk=5ad4e5e44a6c573b3be1967c9abdcc35 👾 Discord https://discord.gg/c5QtDB9RAP 🕹️ Free AI-Powered Code Refactoring with Sourcery: https://sourcery.ai/?utm_source=YouTub&utm_campaign=JBriggs&utm_medium=aff

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from James Briggs · James Briggs · 41 of 60

← Previous Next →

Stoic Philosophy Text Generation with TensorFlow

Stoic Philosophy Text Generation with TensorFlow

How to Build TensorFlow Pipelines with tf.data.Dataset

How to Build TensorFlow Pipelines with tf.data.Dataset

Every New Feature in Python 3.10.0a2

Every New Feature in Python 3.10.0a2

How-to Build a Transformer for Language Classification in TensorFlow

How-to Build a Transformer for Language Classification in TensorFlow

How-to use the Kaggle API in Python

How-to use the Kaggle API in Python

Language Generation with OpenAI's GPT-2 in Python

Language Generation with OpenAI's GPT-2 in Python

Text Summarization with Google AI's T5 in Python

Text Summarization with Google AI's T5 in Python

How-to do Sentiment Analysis with Flair in Python

How-to do Sentiment Analysis with Flair in Python

Python Environment Setup for Machine Learning

Python Environment Setup for Machine Learning

Sequential Model - TensorFlow Essentials #1

Sequential Model - TensorFlow Essentials #1

Functional API - TensorFlow Essentials #2

Functional API - TensorFlow Essentials #2

Training Parameters - TensorFlow Essentials #3

Training Parameters - TensorFlow Essentials #3

Input Data Pipelines - TensorFlow Essentials #4

Input Data Pipelines - TensorFlow Essentials #4

6 of Python's Newest and Best Features (3.7-3.9)

6 of Python's Newest and Best Features (3.7-3.9)

Novice to Advanced RegEx in Less-than 30 Minutes + Python

Novice to Advanced RegEx in Less-than 30 Minutes + Python

Building a PlotLy $GME Chart in Python

Building a PlotLy $GME Chart in Python

How-to Use The Reddit API in Python

How-to Use The Reddit API in Python

How to Build Custom Q&A Transformer Models in Python

How to Build Custom Q&A Transformer Models in Python

How to Build Q&A Models in Python (Transformers)

How to Build Q&A Models in Python (Transformers)

How-to Decode Outputs From NLP Models (Python)

How-to Decode Outputs From NLP Models (Python)

Identify Stocks on Reddit with SpaCy (NER in Python)

Identify Stocks on Reddit with SpaCy (NER in Python)

Sentiment Analysis on ANY Length of Text With Transformers (Python)

Sentiment Analysis on ANY Length of Text With Transformers (Python)

Unicode Normalization for NLP in Python

Unicode Normalization for NLP in Python

The NEW Match-Case Statement in Python 3.10

The NEW Match-Case Statement in Python 3.10

Multi-Class Language Classification With BERT in TensorFlow

Multi-Class Language Classification With BERT in TensorFlow

How to Build Python Packages for Pip

How to Build Python Packages for Pip

How-to Structure a Q&A ML App

How-to Structure a Q&A ML App

How to Index Q&A Data With Haystack and Elasticsearch

How to Index Q&A Data With Haystack and Elasticsearch

Q&A Document Retrieval With DPR

Q&A Document Retrieval With DPR

How to Use Type Annotations in Python

How to Use Type Annotations in Python

Extractive Q&A With Haystack and FastAPI in Python

Extractive Q&A With Haystack and FastAPI in Python

Sentence Similarity With Sentence-Transformers in Python

Sentence Similarity With Sentence-Transformers in Python

Sentence Similarity With Transformers and PyTorch (Python)

Sentence Similarity With Transformers and PyTorch (Python)

NER With Transformers and spaCy (Python)

NER With Transformers and spaCy (Python)

Training BERT #1 - Masked-Language Modeling (MLM)

Training BERT #1 - Masked-Language Modeling (MLM)

Training BERT #2 - Train With Masked-Language Modeling (MLM)

Training BERT #2 - Train With Masked-Language Modeling (MLM)

Training BERT #3 - Next Sentence Prediction (NSP)

Training BERT #3 - Next Sentence Prediction (NSP)

Training BERT #4 - Train With Next Sentence Prediction (NSP)

Training BERT #4 - Train With Next Sentence Prediction (NSP)

FREE 11 Hour NLP Transformers Course (Next 3 Days Only)

FREE 11 Hour NLP Transformers Course (Next 3 Days Only)

New Features in Python 3.10

New Features in Python 3.10

Training BERT #5 - Training With BertForPretraining

Training BERT #5 - Training With BertForPretraining

How-to Use HuggingFace's Datasets - Transformers From Scratch #1

How-to Use HuggingFace's Datasets - Transformers From Scratch #1

Build a Custom Transformer Tokenizer - Transformers From Scratch #2

Build a Custom Transformer Tokenizer - Transformers From Scratch #2

3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)

3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

Building MLM Training Input Pipeline - Transformers From Scratch #3

Building MLM Training Input Pipeline - Transformers From Scratch #3

Training and Testing an Italian BERT - Transformers From Scratch #4

Training and Testing an Italian BERT - Transformers From Scratch #4

Faiss - Introduction to Similarity Search

Faiss - Introduction to Similarity Search

Angular App Setup With Material - Stoic Q&A #5

Angular App Setup With Material - Stoic Q&A #5

Why are there so many Tokenization methods in HF Transformers?

Why are there so many Tokenization methods in HF Transformers?

Choosing Indexes for Similarity Search (Faiss in Python)

Choosing Indexes for Similarity Search (Faiss in Python)

Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)

Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)

How LSH Random Projection works in search (+Python)

How LSH Random Projection works in search (+Python)

IndexLSH for Fast Similarity Search in Faiss

IndexLSH for Fast Similarity Search in Faiss

Faiss - Vector Compression with PQ and IVFPQ (in Python)

Faiss - Vector Compression with PQ and IVFPQ (in Python)

Product Quantization for Vector Similarity Search (+ Python)

Product Quantization for Vector Similarity Search (+ Python)

How to Build a Bert WordPiece Tokenizer in Python and HuggingFace

How to Build a Bert WordPiece Tokenizer in Python and HuggingFace

Metadata Filtering for Vector Search + Latest Filter Tech

Metadata Filtering for Vector Search + Latest Filter Tech

Build NLP Pipelines with HuggingFace Datasets

Build NLP Pipelines with HuggingFace Datasets

Composite Indexes and the Faiss Index Factory

Composite Indexes and the Faiss Index Factory

This video teaches how to train BERT with MLM and NSP for pre-training, fine-tune BERT for specific use cases, and utilize various tools like Hugging Face Transformers and PyTorch. The skills learned can be applied to improve language understanding and create effective prompts for BERT.

Key Takeaways

Import necessary libraries and classes
Initialize tokenizer and model
Download data using requests
Load data into tokenizer and model
Create a set of random sentences for NSP training data
Create a 50-50 split for NSP training data
Randomly select sentence A and B from a paragraph
Tokenize data using Hugging Face Transformers
Create input IDs, token type IDs, and attention mask tensors
Create a next sentence label tensor

💡 The video highlights the importance of fine-tuning BERT for specific use cases and demonstrates how to utilize various tools to improve language understanding.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

Embeddings Simplified

Learn the basics of embeddings and how they simplify complex data, a crucial concept in AI and ML

I built a tool that cuts Claude/ChatGPT token usage by 97% — here's how it works

Learn how to build a tool that reduces Claude/ChatGPT token usage by 97%, increasing productivity and efficiency in debugging and development

Dev.to · Rohith Matam

Building LSTMs with PyTorch and Lightning AI Part 7: Resuming Training with Checkpoints

Learn to resume LSTM training with checkpoints using PyTorch and Lightning AI, enabling efficient model iteration and development

Dev.to · Rijul Rajesh

How AI Learns with Less Labeled Data

Learn how AI can learn with less labeled data, a crucial aspect of machine learning beyond model selection

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)