Building MLM Training Input Pipeline - Transformers From Scratch #3
Key Takeaways
The video demonstrates building an MLM training input pipeline using PyTorch, Italian BERT model, and Roberta model for tokenization, and creating a dataset object to load data into a model for fine-tuning and training.
Full Transcript
hi welcome to this video so this is the the third video in our transformers from scratch miniseries and in the last two videos we basically got a load of data and trained our tokenizer which is what you can see here so i'm gonna have to re-run these okay and we just tokenize some italians so this is a italian bert model roberta model so we can write something like this which means hello how are you and it will tokenize our text here now this is where we've got up to what we now want to do is build out a input pipeline and train a model now this is reasonably i would say involved because we we need well we need to do a few things first thing is we need three different tenses in our model here we need the input ids and attention mask we also need the labels tensor as well so the labels tensor is actually just going to be our input ids as they are at the moment but our input ids are not going to be that our input ids they need to be passed through a mass language modeling script which will mask around 15 of the tokens within that tensor and then whilst we're training our model is essentially going to try and guess what those mass tokens are and we'll optimize the model using the loss between the guesses that the model outputs from the input ids and the real values are our labels so that's essentially how it's going to work i suppose it's a lot easier said than done so first thing we're going to do is create our mass language modeling function if you have watched some of my videos before we quite recently did i think two videos maybe two videos on mass language modeling i'll leave a link to those in the description because the code is pretty much the same as what we cover there and i will cover it very quickly here but but not not too in-depth so if you know if you're interested they those links will be there in the description so the very first thing we need to do which i haven't done yet is import torch so using pie torch here and we need to create a random array so we write torch run and this random array needs to be in the same shape as our tensor that we've input up here so let's write tensor dot shape and then we want to mask around 15 of of those so to to do that we can use rand where rand is less than 0.15 because what we've created here is a array in the shape of our input tensor which is going to be our input ids where every value is within the range of zero or two to one so and and that's completely run so that means under there should be or for each token there's a roughly 15 chance of that value being under 0.15 so that's our first criteria for our random masking array you know the roughly fifteen percent which you can call this masqueray as well but there's a few other criteria as well and again you know i covered that in the other videos but in short we don't want to mask special tokens so we can see up here we have two special tokens if we add padding so if i i'm just going to add a little bit of padding not loads let's go match left 10 and we write padding equals max length if we do that we get this these extra ones here there are padding tokens so we basically say we don't want to mask our zeros twos or ones because there's special tokens that we don't want to mask so we also just put where tensor is not equal to zero and let's just copy that it's a little bit easier and also not equal to one and it will also not be equal to two so and i do wonder if yeah we could we could make that a bit nicer so if we just do so you can either do this right where you you specify each token and you will want to do that sometimes maybe like because your special tokens are in the range of like 100 0 101 so there's a few different ones but because we've got everything it's either 2 or below we could just write this so we could say where is not always greater than two so this is like an and statement saying we're going to mask tokens that have a randomly generated value of less than 0.15 that's our 15 criteria and they're not a special token eg they are greater than the value 2 because our special tokens are 0 1 and 2. so that's cool and now what we want to do is loop through each row in our in our tensor so we want to do 4i in range tensor dot shape zero so this is how many rows we have in our tensor and we can't do this in parallel because each row is going to have a different number of tokens that will be mass so if we did this in parallel we'd end up trying to fit different size rows into an equally sized tensor so we we can't do that again if this is confusing i have those videos but i mean you don't need to specifically narrow everything that's going on here this is just how we mask there is roughly 15 of of tokens so we want torch flatten and this is this is a bit confusing but we want to take the masquerade at the current position and say where it's not zero so when we create this mass grade we essentially get a load of true or false values in the the size of our tensor shape where we have ones that is a mask and what we're doing here is we're saying get me a list of all the values that are not zero eg they're ones and that gives us a because it's like a list within a list so we get something like this and it will say like um indices 2 4 you know 18 i don't know why i said four it's five two five eighteen they are where your mask tokens will be and then we use torch flatten here to remove that outer list and at the end here we're going to convey it to a list so that we can do some some fancy indexing in a moment and that fancy indexing looks like this so we have our tensor we're specifying the current row because we're going row time and then we want to specify that selected number of indices which are how where we're going to place our mask now what does the mask token look like well well we can we can actually find it over here over here in our vocab dot json yeah so scroll to the top and we see our our mappings here so the mask token is number four so that's that's what we're going to use switch back over so we're going to make those values equal to 4. that's our mask then at that point we have successfully masked our input ids and we want to return the sensor so that's our masking function that's a big part of this video that's one of the harder parts so now we're going to do is i'm going to scroll up a little bit to here so we have i'm just going to take this so this will give us a list of all of our training files so here and we just need to do from path lib import path okay let's have a look at what we have so this is just a list of everything that we have over here so these are text files containing our italian samples each sample is separated by a new line character and each uh each file also contains like 10 000 samples so we have quite quite a bit of data and what we're going to do here is we're going to create our three tenths so three tangent tenses that i mentioned before we have if i make lists i didn't make a list so we have the labels in input ids and then we also have the attention master as well so let's first initialize there is a list so input ids attention mass or some calling mask and labels and what we're going to do is oh i also so we're going to use a progress bar here so i'm just going to import so from tqdm auto import tqdm i'm just going to import that as well and what i'm going to do is loop through each path in our wrap it in tqdm this creates our progress bar in our paths for each path we're going to load it extract all data convert it into the correct format that we need here and append each one of those two to these lists and then create a big tensor out of that so we want to write with open and then here we have our path we're reading and the encoding is utf-8 as f we want to write text equals f dot read dot split like that so i'm going to lay lines so this is just a big list of 10 000 samples that are all italian okay so then we want to encode that so we write sample equals tokenizer lines on our max length which is going to be 512. we want padding up to that much low and we also want to truncate anything that is further than that so truncation equals true okay that's that's our tokenization done and then we want to extract we want to extract all of those and add them to our to our list so we get our labels first now the labels are just the input ids produced by our sample so sample input ids and i'm thinking here we can do return sentences use pie torch so append our empire these two labels and then we have our mask we want to append the sample attention mask and then we can we can also see that up here by the way here this is what we're doing we're taking those out putting them into our list and then so we have labels masks we're going to create out input ids now input ids that's what we built this mass language modeling function 4 and in there we need to pass our tensor so to do that we just want to write sample input ids and before i forget that needs to go within mlm like that now i don't want to modify that tensor because it's been appended to labels so i'm going to create clone of that and that will be done using attach and dot clone like that so it's pretty good let's run that okay and it's going to take a long time so yeah i'm not going to use all of them yeah it was going up as well so i have no idea how long it would take let's leave that for a little bit let's get let's go to the first 50 for now still gotta wait a little while but at least not as long so i'll leave that to to run hopefully it shouldn't take too long and yeah i'll see you i'll see when it's done okay so that's done wasn't too long and if we just have a look so input id is at the moment is just a big list i don't know if it's a good idea but here we go so we just have like a list of tensors what we can do is rather than having lesser tenses we can use something called torch cat and torch cat expects a list of tenses to be passed to it which is why i've done this but we have lists and we just append tenses to it and we can do that and it will concatenate our tenses which is is pretty cool so what we want to do now is we write ids and we're just going to concatenate all of our tenses so then they're ready for formatting into a data set so we have mask here and labels here we can also see just worth pointing out we have that math token there so we know that we have mass tokens in our input ids now if we let's run that and let's just compare so let's go input ids zero that's quite a lot so can i obviously first ten and then let's do the same for labels we'll see that we don't have these fours or we hopefully shouldn't have those fours okay so that's that's essentially a masking operation so cover this with a mask here and same here and here here and here okay cool now the format that our dataset needs and our model needs is a dictionary where we have input ids which maps to embodies obviously and you can you can guess either too as well so impact ids this one attention mask to mask and the final one is labels so their encodings now we create a dataset object to create a dataset object in fact actually we create a dataset object to create a data loader object which is what we use to load data into our model and that's essentially our input pipeline so but to create that data loader we need to create a dataset object now the data set object we create that by like this so we do class data set call it whatever you want and we want torch utils data data set like that we need a initialization function which is going to store our encodings internally don't forget to death there so we want to write self encodings equals encodings so this is initializing our dataset object and then there's two other methods that this object needs we need a length method so that we can say length data set and it will return the number of samples that are in the data sets and we also need a get item method which will allow the data loader to extract a certain so say if it says you know give me number one it's going to go into this data object and extract the tensors the input ids attention master and labels at position one so that's yeah that's what we need to do there so we'll do length first and length we don't need to pass anything in there we're just calling it length so from that we just want to returned itself encodings do input ids and remember before we did this shape and we took the first one that was usually the length so if i if i took let's take employees if i can just do here so i'll copy that if i go here we get that 500k which is the number of samples we have that's what we want to return okay so that's our len and then we also have the get item so here we do want to pass a index value so this is going to be data load is requesting a certain position and for that we want to return so we're going to return dictionary it needs to be in this format here but we need to specify you know the correct index now what we could do is we could do like self encodings and then access our input ids like that we also i need to change that here so we'll give us an error dot shape and we could we could do that so we could take that um like so and then just say index position that's fine you can do that if you if you want but an easy way of doing it where we don't need to specify the we don't care about the structure of the data set we just want to you know get it out we don't need to specify it we can just do this right key tensor so the specific index of that tensor for key tensor in self encodings the items so if we if we were to go encoding some items so we can do that here see we get essentially everything in our data set so we're just looping through that returning it and specifying which index we're returning here so once we have written that we can initialize our data set so right data set equals data set and then we just pass in our encodings there so let's remove that and encodings that's it so that's our data set and now we initialize our data loader so this this is pretty much it for our input pipeline so data loader which is torch utils it's coming from same area as our data set data loader now we pass in our data set object we want to specify a batch size so i typically go with 16 this will depend on how much your compute can handle it once as well so just you know play around that see what works and we also want to shuffle our data set as well so yeah that's that's our input pipeline after that obviously we want to feed it in and train our model with it so that's we're going to cover that in the in the next video so thank you for watching and i will see you in the next one
Original Description
The input pipeline of our training process is the more complex part of the entire transformer build. It consists of us taking our raw OSCAR training data, transforming it, and preparing it for Masked-Language Modeling (MLM). Finally, we load our data into a DataLoader ready for training!
Part 1: https://youtu.be/GhGUZrcB-WM
Part 2: https://youtu.be/JIeAB8vvBQo
---
Part 4: https://youtu.be/35Pdoyi6ZoQ
📙 Medium article:
https://towardsdatascience.com/how-to-train-a-bert-model-from-scratch-72cfce554fc6
📖 Free link:
https://towardsdatascience.com/how-to-train-a-bert-model-from-scratch-72cfce554fc6?sk=9db6224efbd4ec6fd407a80b528e69b0
🤖 70% Discount on the NLP With Transformers in Python course:
https://bit.ly/3DFvvY5
👾 Discord
https://discord.gg/c5QtDB9RAP
🕹️ Free AI-Powered Code Refactoring with Sourcery:
https://sourcery.ai/?utm_source=YouTub&utm_campaign=JBriggs&utm_medium=aff
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from James Briggs · James Briggs · 46 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
▶
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Stoic Philosophy Text Generation with TensorFlow
James Briggs
How to Build TensorFlow Pipelines with tf.data.Dataset
James Briggs
Every New Feature in Python 3.10.0a2
James Briggs
How-to Build a Transformer for Language Classification in TensorFlow
James Briggs
How-to use the Kaggle API in Python
James Briggs
Language Generation with OpenAI's GPT-2 in Python
James Briggs
Text Summarization with Google AI's T5 in Python
James Briggs
How-to do Sentiment Analysis with Flair in Python
James Briggs
Python Environment Setup for Machine Learning
James Briggs
Sequential Model - TensorFlow Essentials #1
James Briggs
Functional API - TensorFlow Essentials #2
James Briggs
Training Parameters - TensorFlow Essentials #3
James Briggs
Input Data Pipelines - TensorFlow Essentials #4
James Briggs
6 of Python's Newest and Best Features (3.7-3.9)
James Briggs
Novice to Advanced RegEx in Less-than 30 Minutes + Python
James Briggs
Building a PlotLy $GME Chart in Python
James Briggs
How-to Use The Reddit API in Python
James Briggs
How to Build Custom Q&A Transformer Models in Python
James Briggs
How to Build Q&A Models in Python (Transformers)
James Briggs
How-to Decode Outputs From NLP Models (Python)
James Briggs
Identify Stocks on Reddit with SpaCy (NER in Python)
James Briggs
Sentiment Analysis on ANY Length of Text With Transformers (Python)
James Briggs
Unicode Normalization for NLP in Python
James Briggs
The NEW Match-Case Statement in Python 3.10
James Briggs
Multi-Class Language Classification With BERT in TensorFlow
James Briggs
How to Build Python Packages for Pip
James Briggs
How-to Structure a Q&A ML App
James Briggs
How to Index Q&A Data With Haystack and Elasticsearch
James Briggs
Q&A Document Retrieval With DPR
James Briggs
How to Use Type Annotations in Python
James Briggs
Extractive Q&A With Haystack and FastAPI in Python
James Briggs
Sentence Similarity With Sentence-Transformers in Python
James Briggs
Sentence Similarity With Transformers and PyTorch (Python)
James Briggs
NER With Transformers and spaCy (Python)
James Briggs
Training BERT #1 - Masked-Language Modeling (MLM)
James Briggs
Training BERT #2 - Train With Masked-Language Modeling (MLM)
James Briggs
Training BERT #3 - Next Sentence Prediction (NSP)
James Briggs
Training BERT #4 - Train With Next Sentence Prediction (NSP)
James Briggs
FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
James Briggs
New Features in Python 3.10
James Briggs
Training BERT #5 - Training With BertForPretraining
James Briggs
How-to Use HuggingFace's Datasets - Transformers From Scratch #1
James Briggs
Build a Custom Transformer Tokenizer - Transformers From Scratch #2
James Briggs
3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
James Briggs
3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
James Briggs
Building MLM Training Input Pipeline - Transformers From Scratch #3
James Briggs
Training and Testing an Italian BERT - Transformers From Scratch #4
James Briggs
Faiss - Introduction to Similarity Search
James Briggs
Angular App Setup With Material - Stoic Q&A #5
James Briggs
Why are there so many Tokenization methods in HF Transformers?
James Briggs
Choosing Indexes for Similarity Search (Faiss in Python)
James Briggs
Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
James Briggs
How LSH Random Projection works in search (+Python)
James Briggs
IndexLSH for Fast Similarity Search in Faiss
James Briggs
Faiss - Vector Compression with PQ and IVFPQ (in Python)
James Briggs
Product Quantization for Vector Similarity Search (+ Python)
James Briggs
How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
James Briggs
Metadata Filtering for Vector Search + Latest Filter Tech
James Briggs
Build NLP Pipelines with HuggingFace Datasets
James Briggs
Composite Indexes and the Faiss Index Factory
James Briggs
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
I Asked ChatGPT to Fix My Life. It Couldn’t — Until I Changed One Thing
Medium · AI
I Asked ChatGPT to Fix My Life. It Couldn’t — Until I Changed One Thing
Medium · ChatGPT
Claude Sonnet 5 Is Here: Why It Might Replace Your Opus Subscription
Medium · Programming
Introducing Claude Sonnet 5 on AWS: Anthropic’s most capable Sonnet model
AWS Machine Learning
🎓
Tutor Explanation
DeepCamp AI