Building MLM Training Input Pipeline - Transformers From Scratch #3

James Briggs · Advanced ·🧠 Large Language Models ·4y ago

Skills: LLM Foundations90%Fine-tuning LLMs80%LLM Engineering70%

Key Takeaways

The video demonstrates building an MLM training input pipeline using PyTorch, Italian BERT model, and Roberta model for tokenization, and creating a dataset object to load data into a model for fine-tuning and training.

Full Transcript

hi welcome to this video so this is the the third video in our transformers from scratch miniseries and in the last two videos we basically got a load of data and trained our tokenizer which is what you can see here so i'm gonna have to re-run these okay and we just tokenize some italians so this is a italian bert model roberta model so we can write something like this which means hello how are you and it will tokenize our text here now this is where we've got up to what we now want to do is build out a input pipeline and train a model now this is reasonably i would say involved because we we need well we need to do a few things first thing is we need three different tenses in our model here we need the input ids and attention mask we also need the labels tensor as well so the labels tensor is actually just going to be our input ids as they are at the moment but our input ids are not going to be that our input ids they need to be passed through a mass language modeling script which will mask around 15 of the tokens within that tensor and then whilst we're training our model is essentially going to try and guess what those mass tokens are and we'll optimize the model using the loss between the guesses that the model outputs from the input ids and the real values are our labels so that's essentially how it's going to work i suppose it's a lot easier said than done so first thing we're going to do is create our mass language modeling function if you have watched some of my videos before we quite recently did i think two videos maybe two videos on mass language modeling i'll leave a link to those in the description because the code is pretty much the same as what we cover there and i will cover it very quickly here but but not not too in-depth so if you know if you're interested they those links will be there in the description so the very first thing we need to do which i haven't done yet is import torch so using pie torch here and we need to create a random array so we write torch run and this random array needs to be in the same shape as our tensor that we've input up here so let's write tensor dot shape and then we want to mask around 15 of of those so to to do that we can use rand where rand is less than 0.15 because what we've created here is a array in the shape of our input tensor which is going to be our input ids where every value is within the range of zero or two to one so and and that's completely run so that means under there should be or for each token there's a roughly 15 chance of that value being under 0.15 so that's our first criteria for our random masking array you know the roughly fifteen percent which you can call this masqueray as well but there's a few other criteria as well and again you know i covered that in the other videos but in short we don't want to mask special tokens so we can see up here we have two special tokens if we add padding so if i i'm just going to add a little bit of padding not loads let's go match left 10 and we write padding equals max length if we do that we get this these extra ones here there are padding tokens so we basically say we don't want to mask our zeros twos or ones because there's special tokens that we don't want to mask so we also just put where tensor is not equal to zero and let's just copy that it's a little bit easier and also not equal to one and it will also not be equal to two so and i do wonder if yeah we could we could make that a bit nicer so if we just do so you can either do this right where you you specify each token and you will want to do that sometimes maybe like because your special tokens are in the range of like 100 0 101 so there's a few different ones but because we've got everything it's either 2 or below we could just write this so we could say where is not always greater than two so this is like an and statement saying we're going to mask tokens that have a randomly generated value of less than 0.15 that's our 15 criteria and they're not a special token eg they are greater than the value 2 because our special tokens are 0 1 and 2. so that's cool and now what we want to do is loop through each row in our in our tensor so we want to do 4i in range tensor dot shape zero so this is how many rows we have in our tensor and we can't do this in parallel because each row is going to have a different number of tokens that will be mass so if we did this in parallel we'd end up trying to fit different size rows into an equally sized tensor so we we can't do that again if this is confusing i have those videos but i mean you don't need to specifically narrow everything that's going on here this is just how we mask there is roughly 15 of of tokens so we want torch flatten and this is this is a bit confusing but we want to take the masquerade at the current position and say where it's not zero so when we create this mass grade we essentially get a load of true or false values in the the size of our tensor shape where we have ones that is a mask and what we're doing here is we're saying get me a list of all the values that are not zero eg they're ones and that gives us a because it's like a list within a list so we get something like this and it will say like um indices 2 4 you know 18 i don't know why i said four it's five two five eighteen they are where your mask tokens will be and then we use torch flatten here to remove that outer list and at the end here we're going to convey it to a list so that we can do some some fancy indexing in a moment and that fancy indexing looks like this so we have our tensor we're specifying the current row because we're going row time and then we want to specify that selected number of indices which are how where we're going to place our mask now what does the mask token look like well well we can we can actually find it over here over here in our vocab dot json yeah so scroll to the top and we see our our mappings here so the mask token is number four so that's that's what we're going to use switch back over so we're going to make those values equal to 4. that's our mask then at that point we have successfully masked our input ids and we want to return the sensor so that's our masking function that's a big part of this video that's one of the harder parts so now we're going to do is i'm going to scroll up a little bit to here so we have i'm just going to take this so this will give us a list of all of our training files so here and we just need to do from path lib import path okay let's have a look at what we have so this is just a list of everything that we have over here so these are text files containing our italian samples each sample is separated by a new line character and each uh each file also contains like 10 000 samples so we have quite quite a bit of data and what we're going to do here is we're going to create our three tenths so three tangent tenses that i mentioned before we have if i make lists i didn't make a list so we have the labels in input ids and then we also have the attention master as well so let's first initialize there is a list so input ids attention mass or some calling mask and labels and what we're going to do is oh i also so we're going to use a progress bar here so i'm just going to import so from tqdm auto import tqdm i'm just going to import that as well and what i'm going to do is loop through each path in our wrap it in tqdm this creates our progress bar in our paths for each path we're going to load it extract all data convert it into the correct format that we need here and append each one of those two to these lists and then create a big tensor out of that so we want to write with open and then here we have our path we're reading and the encoding is utf-8 as f we want to write text equals f dot read dot split like that so i'm going to lay lines so this is just a big list of 10 000 samples that are all italian okay so then we want to encode that so we write sample equals tokenizer lines on our max length which is going to be 512. we want padding up to that much low and we also want to truncate anything that is further than that so truncation equals true okay that's that's our tokenization done and then we want to extract we want to extract all of those and add them to our to our list so we get our labels first now the labels are just the input ids produced by our sample so sample input ids and i'm thinking here we can do return sentences use pie torch so append our empire these two labels and then we have our mask we want to append the sample attention mask and then we can we can also see that up here by the way here this is what we're doing we're taking those out putting them into our list and then so we have labels masks we're going to create out input ids now input ids that's what we built this mass language modeling function 4 and in there we need to pass our tensor so to do that we just want to write sample input ids and before i forget that needs to go within mlm like that now i don't want to modify that tensor because it's been appended to labels so i'm going to create clone of that and that will be done using attach and dot clone like that so it's pretty good let's run that okay and it's going to take a long time so yeah i'm not going to use all of them yeah it was going up as well so i have no idea how long it would take let's leave that for a little bit let's get let's go to the first 50 for now still gotta wait a little while but at least not as long so i'll leave that to to run hopefully it shouldn't take too long and yeah i'll see you i'll see when it's done okay so that's done wasn't too long and if we just have a look so input id is at the moment is just a big list i don't know if it's a good idea but here we go so we just have like a list of tensors what we can do is rather than having lesser tenses we can use something called torch cat and torch cat expects a list of tenses to be passed to it which is why i've done this but we have lists and we just append tenses to it and we can do that and it will concatenate our tenses which is is pretty cool so what we want to do now is we write ids and we're just going to concatenate all of our tenses so then they're ready for formatting into a data set so we have mask here and labels here we can also see just worth pointing out we have that math token there so we know that we have mass tokens in our input ids now if we let's run that and let's just compare so let's go input ids zero that's quite a lot so can i obviously first ten and then let's do the same for labels we'll see that we don't have these fours or we hopefully shouldn't have those fours okay so that's that's essentially a masking operation so cover this with a mask here and same here and here here and here okay cool now the format that our dataset needs and our model needs is a dictionary where we have input ids which maps to embodies obviously and you can you can guess either too as well so impact ids this one attention mask to mask and the final one is labels so their encodings now we create a dataset object to create a dataset object in fact actually we create a dataset object to create a data loader object which is what we use to load data into our model and that's essentially our input pipeline so but to create that data loader we need to create a dataset object now the data set object we create that by like this so we do class data set call it whatever you want and we want torch utils data data set like that we need a initialization function which is going to store our encodings internally don't forget to death there so we want to write self encodings equals encodings so this is initializing our dataset object and then there's two other methods that this object needs we need a length method so that we can say length data set and it will return the number of samples that are in the data sets and we also need a get item method which will allow the data loader to extract a certain so say if it says you know give me number one it's going to go into this data object and extract the tensors the input ids attention master and labels at position one so that's yeah that's what we need to do there so we'll do length first and length we don't need to pass anything in there we're just calling it length so from that we just want to returned itself encodings do input ids and remember before we did this shape and we took the first one that was usually the length so if i if i took let's take employees if i can just do here so i'll copy that if i go here we get that 500k which is the number of samples we have that's what we want to return okay so that's our len and then we also have the get item so here we do want to pass a index value so this is going to be data load is requesting a certain position and for that we want to return so we're going to return dictionary it needs to be in this format here but we need to specify you know the correct index now what we could do is we could do like self encodings and then access our input ids like that we also i need to change that here so we'll give us an error dot shape and we could we could do that so we could take that um like so and then just say index position that's fine you can do that if you if you want but an easy way of doing it where we don't need to specify the we don't care about the structure of the data set we just want to you know get it out we don't need to specify it we can just do this right key tensor so the specific index of that tensor for key tensor in self encodings the items so if we if we were to go encoding some items so we can do that here see we get essentially everything in our data set so we're just looping through that returning it and specifying which index we're returning here so once we have written that we can initialize our data set so right data set equals data set and then we just pass in our encodings there so let's remove that and encodings that's it so that's our data set and now we initialize our data loader so this this is pretty much it for our input pipeline so data loader which is torch utils it's coming from same area as our data set data loader now we pass in our data set object we want to specify a batch size so i typically go with 16 this will depend on how much your compute can handle it once as well so just you know play around that see what works and we also want to shuffle our data set as well so yeah that's that's our input pipeline after that obviously we want to feed it in and train our model with it so that's we're going to cover that in the in the next video so thank you for watching and i will see you in the next one

Original Description

The input pipeline of our training process is the more complex part of the entire transformer build. It consists of us taking our raw OSCAR training data, transforming it, and preparing it for Masked-Language Modeling (MLM). Finally, we load our data into a DataLoader ready for training! Part 1: https://youtu.be/GhGUZrcB-WM Part 2: https://youtu.be/JIeAB8vvBQo --- Part 4: https://youtu.be/35Pdoyi6ZoQ 📙 Medium article: https://towardsdatascience.com/how-to-train-a-bert-model-from-scratch-72cfce554fc6 📖 Free link: https://towardsdatascience.com/how-to-train-a-bert-model-from-scratch-72cfce554fc6?sk=9db6224efbd4ec6fd407a80b528e69b0 🤖 70% Discount on the NLP With Transformers in Python course: https://bit.ly/3DFvvY5 👾 Discord https://discord.gg/c5QtDB9RAP 🕹️ Free AI-Powered Code Refactoring with Sourcery: https://sourcery.ai/?utm_source=YouTub&utm_campaign=JBriggs&utm_medium=aff

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from James Briggs · James Briggs · 46 of 60

← Previous Next →

Stoic Philosophy Text Generation with TensorFlow

Stoic Philosophy Text Generation with TensorFlow

How to Build TensorFlow Pipelines with tf.data.Dataset

How to Build TensorFlow Pipelines with tf.data.Dataset

Every New Feature in Python 3.10.0a2

Every New Feature in Python 3.10.0a2

How-to Build a Transformer for Language Classification in TensorFlow

How-to Build a Transformer for Language Classification in TensorFlow

How-to use the Kaggle API in Python

How-to use the Kaggle API in Python

Language Generation with OpenAI's GPT-2 in Python

Language Generation with OpenAI's GPT-2 in Python

Text Summarization with Google AI's T5 in Python

Text Summarization with Google AI's T5 in Python

How-to do Sentiment Analysis with Flair in Python

How-to do Sentiment Analysis with Flair in Python

Python Environment Setup for Machine Learning

Python Environment Setup for Machine Learning

Sequential Model - TensorFlow Essentials #1

Sequential Model - TensorFlow Essentials #1

Functional API - TensorFlow Essentials #2

Functional API - TensorFlow Essentials #2

Training Parameters - TensorFlow Essentials #3

Training Parameters - TensorFlow Essentials #3

Input Data Pipelines - TensorFlow Essentials #4

Input Data Pipelines - TensorFlow Essentials #4

6 of Python's Newest and Best Features (3.7-3.9)

6 of Python's Newest and Best Features (3.7-3.9)

Novice to Advanced RegEx in Less-than 30 Minutes + Python

Novice to Advanced RegEx in Less-than 30 Minutes + Python

Building a PlotLy $GME Chart in Python

Building a PlotLy $GME Chart in Python

How-to Use The Reddit API in Python

How-to Use The Reddit API in Python

How to Build Custom Q&A Transformer Models in Python

How to Build Custom Q&A Transformer Models in Python

How to Build Q&A Models in Python (Transformers)

How to Build Q&A Models in Python (Transformers)

How-to Decode Outputs From NLP Models (Python)

How-to Decode Outputs From NLP Models (Python)

Identify Stocks on Reddit with SpaCy (NER in Python)

Identify Stocks on Reddit with SpaCy (NER in Python)

Sentiment Analysis on ANY Length of Text With Transformers (Python)

Sentiment Analysis on ANY Length of Text With Transformers (Python)

Unicode Normalization for NLP in Python

Unicode Normalization for NLP in Python

The NEW Match-Case Statement in Python 3.10

The NEW Match-Case Statement in Python 3.10

Multi-Class Language Classification With BERT in TensorFlow

Multi-Class Language Classification With BERT in TensorFlow

How to Build Python Packages for Pip

How to Build Python Packages for Pip

How-to Structure a Q&A ML App

How-to Structure a Q&A ML App

How to Index Q&A Data With Haystack and Elasticsearch

How to Index Q&A Data With Haystack and Elasticsearch

Q&A Document Retrieval With DPR

Q&A Document Retrieval With DPR

How to Use Type Annotations in Python

How to Use Type Annotations in Python

Extractive Q&A With Haystack and FastAPI in Python

Extractive Q&A With Haystack and FastAPI in Python

Sentence Similarity With Sentence-Transformers in Python

Sentence Similarity With Sentence-Transformers in Python

Sentence Similarity With Transformers and PyTorch (Python)

Sentence Similarity With Transformers and PyTorch (Python)

NER With Transformers and spaCy (Python)

NER With Transformers and spaCy (Python)

Training BERT #1 - Masked-Language Modeling (MLM)

Training BERT #1 - Masked-Language Modeling (MLM)

Training BERT #2 - Train With Masked-Language Modeling (MLM)

Training BERT #2 - Train With Masked-Language Modeling (MLM)

Training BERT #3 - Next Sentence Prediction (NSP)

Training BERT #3 - Next Sentence Prediction (NSP)

Training BERT #4 - Train With Next Sentence Prediction (NSP)

Training BERT #4 - Train With Next Sentence Prediction (NSP)

FREE 11 Hour NLP Transformers Course (Next 3 Days Only)

FREE 11 Hour NLP Transformers Course (Next 3 Days Only)

New Features in Python 3.10

New Features in Python 3.10

Training BERT #5 - Training With BertForPretraining

Training BERT #5 - Training With BertForPretraining

How-to Use HuggingFace's Datasets - Transformers From Scratch #1

How-to Use HuggingFace's Datasets - Transformers From Scratch #1

Build a Custom Transformer Tokenizer - Transformers From Scratch #2

Build a Custom Transformer Tokenizer - Transformers From Scratch #2

3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)

3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

Building MLM Training Input Pipeline - Transformers From Scratch #3

Building MLM Training Input Pipeline - Transformers From Scratch #3

Training and Testing an Italian BERT - Transformers From Scratch #4

Training and Testing an Italian BERT - Transformers From Scratch #4

Faiss - Introduction to Similarity Search

Faiss - Introduction to Similarity Search

Angular App Setup With Material - Stoic Q&A #5

Angular App Setup With Material - Stoic Q&A #5

Why are there so many Tokenization methods in HF Transformers?

Why are there so many Tokenization methods in HF Transformers?

Choosing Indexes for Similarity Search (Faiss in Python)

Choosing Indexes for Similarity Search (Faiss in Python)

Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)

Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)

How LSH Random Projection works in search (+Python)

How LSH Random Projection works in search (+Python)

IndexLSH for Fast Similarity Search in Faiss

IndexLSH for Fast Similarity Search in Faiss

Faiss - Vector Compression with PQ and IVFPQ (in Python)

Faiss - Vector Compression with PQ and IVFPQ (in Python)

Product Quantization for Vector Similarity Search (+ Python)

Product Quantization for Vector Similarity Search (+ Python)

How to Build a Bert WordPiece Tokenizer in Python and HuggingFace

How to Build a Bert WordPiece Tokenizer in Python and HuggingFace

Metadata Filtering for Vector Search + Latest Filter Tech

Metadata Filtering for Vector Search + Latest Filter Tech

Build NLP Pipelines with HuggingFace Datasets

Build NLP Pipelines with HuggingFace Datasets

Composite Indexes and the Faiss Index Factory

Composite Indexes and the Faiss Index Factory

This video teaches how to build an MLM training input pipeline using PyTorch and popular tokenization models, and how to create a dataset object to load data into a model for fine-tuning and training. The pipeline is designed to feed data to the model for training, and the next video will cover training the model with the pipeline.

Key Takeaways

Create a random array to mask tokens in the input tensor
Use torch.flatten to remove outer list and get a list of mask token indices
Create a list of training files using pathlib
Create three lists: input_ids, attention_mask, labels
Use tqdm to create a progress bar
Load data and create progress bar
Tokenize text data
Create tensor from tokenized data
Extract input ids, attention mask, and labels
Concatenate tensors with torch.cat

💡 The input pipeline is a critical component of the MLM training process, and using popular tokenization models and PyTorch can simplify the process of building and fine-tuning LLM models.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

I Asked ChatGPT to Fix My Life. It Couldn’t — Until I Changed One Thing

Learn how to effectively use AI like ChatGPT to improve your life by changing your approach

I Asked ChatGPT to Fix My Life. It Couldn’t — Until I Changed One Thing

Learn how to effectively use ChatGPT to solve personal problems by changing your approach

Medium · ChatGPT

Claude Sonnet 5 Is Here: Why It Might Replace Your Opus Subscription

Learn about Claude Sonnet 5, a new AI model that offers near-flagship performance at a lower price, and its potential to replace Opus subscriptions

Medium · Programming

Introducing Claude Sonnet 5 on AWS: Anthropic’s most capable Sonnet model

Learn about Claude Sonnet 5, Anthropic's most advanced Sonnet model, now available on AWS, and how it delivers top-tier intelligence for coding, agents, and professional tasks

AWS Machine Learning

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)