Training and Testing an Italian BERT - Transformers From Scratch #4

James Briggs · Advanced ·🧠 Large Language Models ·4y ago

Skills: LLM Foundations90%Fine-tuning LLMs80%LLM Engineering60%

Key Takeaways

The video demonstrates training and testing an Italian BERT model from scratch using the Transformers library, configuring the Roberta model, and fine-tuning it with the AdamW optimizer. It also covers testing the model with Italian sentences and verb conjugations, and planning to upload the model to the Hugging Face Model Hub.

Full Transcript

hi welcome to the video so this is the fourth video in a transformers from scratch mini series so if you haven't been following along we've essentially covered what you can see on the screen so we got some data we built a tokenizer with it and then we've set up our input pipeline ready to begin actually training our model which is what we're going to cover in this video so let's move over to the code and we see here that we have essentially everything we've done so far so we've built our input data our input pipeline and we're now at a point where we have a data loader pi torch data loader ready and we can begin training a model with it so there are a few things to be aware of so i mean first let's just have a quick look at the structure of our data so when we're training a model for mass language modeling we need a few a few tensors we need we need three tensors and this is for training roberta by the way as well same thing with as well we have our empire ids attention mask and our labels our input ids have roughly 15 of their values mass so we can see that here we have these two tensors these are the labels and we have the real tokens in here the token ids and then in our input ids tensor we have these have been replaced with master tokens number fours so that's the structure of our input data we've created a torch data set from it and use that to create a torch data loader and with that we can we can actually begin setting up our model for training so there are a few a few things uh to that we can't just begin training straight away so the first thing that we need to do is create a roberta config object and this is the config object is something that we use when we're initializing a transformer from scratch in order to initialize it with a certain set of parameters so we'll do that first so we want from transformers import roberta config okay and to create that config object we do this so we do revert to config and then in here we need to specify different parameters now the one of the main ones is the vocab size now this needs to match to whichever vocab size we have already created in our tokenizer when correct building our tokenizers so i mean for me if i go all the way up here [Music] to to here this is where i created the tokenizer i can see okay it's this number here so 30 522 so i'm going to set that but if we if you don't have that you can just write tokenizer vocab size so here and that will return your your focus so i mean let's let's replace that we'll do this now as well as that we want to also set this so max position embedding and this needs to be set to your max length plus two in this case so uh max length is is set up here so where where is it match left here 512 plus two because we have these added special tokens if we don't do that we'll end up with a index error because we're going beyond the embedding limits then we want our hidden size so this is the size of the vectors that our embedding layers within roberto will create so each token so we have 514 or 12 tokens and each one of those will be assigned a vector of size 768 this is the typical number so that's the originally came from the bert based model then we set up the the architecture of the the internals of the model so we want the number of attention heads which i'm going to set to 12 and also the number of hidden layers which i so the default for this is for roberta 12 but i'm going to go with six for the sake of keeping training times a little shorter and then we also need to add type vocab size which is just one so that's the the different token types that we have we just have one don't need to don't need to worry about that okay so that's our configuration object ready and we can import and initialize a roberta model with that so we want from transformers this is kind of similar to what we usually do import roberta and we're doing this for mass lm so mlm right so we're training using mlm so we want robotic for mass lm and we initialize our model using that roberto for mass lm object and we just pass in our config and this will that's right there is initialize our roberto model so that's a plain roberta model randomly initialize weights and and so on and now we can move on to setting up everything for for training so we have our model now we need to prepare a few things before we train it first thing is we need to decide which device we're going to be training on so whether that's cpu or a cuda enabled gpu and to figure out if we have that we write well we can write torch cuda is available so we write this and for me it is so the typical way that you would you would decide whether you're using uh cuda or cpu or the typical line of code that will decide it for you is the right device and you do torch dot cuda or device sorry and then your iq inside here if it's available otherwise we are going to use torch device cpu now cpu takes yeah it's just it takes a really long time so if you are using cpu um now you have to leave it overnight for sure maybe even longer even if it's just like a little bit of data it takes so long um so but hopefully hopefully you have a gpu if not just you're gonna have to be patient that's all or if you could maybe try and use google colab but you have to use the premium version because otherwise it's just going to shut off after like an hour or two i don't know i don't really use it so i don't know how long it will it would train for before just deciding it's uh it's done and the gpu is also not that good anyway so yeah however however you can however you can do it and then after that we want to move our model to our device so whether it's gpu or cpu we move over there we're going to get a really big output now so some model so this is like the the structure of our model so we can see a few interesting things we got uh roberta for mlm we have the roberto model and then inside that we have our embeddings and then we have our 12 did i say 12 i think it was six six encoders should be yeah so it goes up it goes from zero to five star six and then we have the the outputs here and then our final bit which is the language modeling head the mlm head so that's cool now we need our optimizer so from transformers import and w which is adam with weight decay and and what we're going to do is i'm just going to activate the training mode of our model it's going to give us loads of output again so just yeah you know maybe i can just let's just remove that there we go easier and then our optimizer is going to be adam w we need to pass in our model parameters and we need a learning rate so from i mean i don't usually use roberta but looking online this looks like a reasonable learning rate i think you can go from sort of here to i think from what i remember down to like here that's the sort of typical range but obviously it's going to depend on how much data you have and don't do that how much data you have and loads of different things right so that's what i'm going to go with and that should be pretty much it so that's our sale now we're just going to create our training loop now for the training loop we want to import tqdm so we can see how far through we are we're going to train for two epochs and we're going to initialize our loop object using tqdm so dqdm we have our data loader what is the name of that data loader i'm not sure let's data loader cool data loader and we set leave equals true but i need that sorry i need that in the same cell so four batch in loop and then here we you know run through each of the steps that we're going to perform for every single training glute so the first thing we do is initialize the gradient in our optimizer so zero grab so reason we do this is after the first loop our optimizer is going to be assigned a set of gradients which you're going to use to optimize our model and on the next loop we don't want those residual gradients to still be there in our optimizer we want to essentially reset it for the next loop so that's that's what we're doing here then we want our our tensors so we have input ids and that is going to be batch input ids and we also want to move that over to our our gpu or cpu if you're on if you're on that and this is pretty much the same for how three so mass labels and this is just a tension mask okay so we've extracted our tensors and we just need to feed them into our model now so we're going to get our outputs from the model which is model input ids attention mask which is going to be equal to mask and our labels equal to labels so everything has been fed into our model we have our outputs now we need to extract a few things from the output so we well we need the loss so we write loss equals outputs dot loss and from that we want to calculate all of the different parameters in our model we need to calculate the loss for each one of those parameters so we do this loss dot backwards to back propagate through all of those different values and get that loss after we've done that we use our optimizer take a set and optimize all those parameters based on that that loss then that's everything we need to train the model and then just a few things so for the progress bar i just want a little bit of information there just so i know what's going on and that's right loop set description and that's what i just want to print out the epoch so write that and then i want to set the post fixed as well so loop dot set post fix and here i just want to see lost so we'll just do lost loss item like that so that should be everything yeah let's let's run that see see what happens hopefully it should work nope didn't work okay let's see oh no it's a cuda error so probably just need to refresh everything i hate cuda errors one moment okay so finally figure out took so long so if so a few tips anyway when you do get a cuda error switch your device to cpu and then rerun everything and you should get a more understandable error so if we come down here i've changed its cpu we see that we get an index error scroll down index out of range itself so the reason for this is so you you get this error if you don't have the extra two tokens onto the end of here but you know we added them so i was pretty confused about that and then it took me a really long time to realize that this argument is wrong and there should be an s on the end so that would seem that was the error so yeah super super cool that that that was literally it it took me so long to to figure that out but now we have it that's good we need to run everything again so i'm just going to run through everything remove the remove this this cell here where i change it to cpu because i don't need it now and just re-execute that okay so we're back and we've finished training our model now now it has taken a long time this is a is a few days later um and i made a few changes during training as well so this definitely wasn't the cleanest training process because i was kind of updating parameters as it was going along so initially well first we've trained for like three and a bit epochs and i've trained on the full data set as well so if i come up here i think do i print out how much data it was um maybe in another file so if we come down here so yeah there's a lot more data here so we have 200 no 20 let me think 2 million okay so 2 million samples in that final run and initially when we when we started training we started with a learning rate of one e to the minus five now i looked into this a little bit and it just was not really moving and i'll show you in a minute so i for the second epoch i moved it down to one e to the minus four or moved it up sorry to one e to the minus four and that you know that mood started moving things a lot quicker so that that was good and then in total like i said it was three and a bit epochs other than that i didn't really change anything the only thing i did was i trained like one epoch at a time because i wanted to see how you know how the results were looking after each epoch and that was quite interesting so let me let me show you that okay so this is after the first epoch so okay we so here what i'm doing is i've got this fill which is a pipeline fill object and i'm entering ciao and then putting in now mass and then and i'm i'm i wanted to say ciao come over right in the middle i want that to predict comey now this is the after the first epoch and we can see it's not yeah it's it's just it's putting like random [Music] random characters so question mark here three dots here uh chao and chao again here kind of weird so yeah not not the best right then we move on to the second epoch and it's getting it's just well it's so rubbish at least it's got words right so like here we have a word uh ciao kiva or chiva okay ciao kiva i don't know if that's the way i always the c h in italian i always get messed up if there's any italians watching them i'm sorry um ciao you know at least we're getting words but none of these so it doesn't make any sense okay so no i'm still not good now if we come across again so this is this one yeah this one now we get it so the first the the rest of these are kind of the rest of them are nonsense okay so the the four here ignore them however at the top we get this score of zero point three three and we get ciao kamivas so that's what we wanted so that's good means it's working this was this was after the third in a bit epoch let me show you loss function as well so this i know this is really messy um so here we have our i don't know why this one's so short actually why is that one so short hmm strange but maybe i didn't yeah if the last one doesn't look like i finished training for the full epoch so i thought i did uh maybe something happened i'm not sure but fine uh that's is what it is that's fine so the first set of training i did was it was here and you see in the middle my my computer went to sleep for a bit overnight because it was just so loud so i turned it off for a bit um and then continue going down now this first epoch is when we're at one point or one e to the minus five and then here i was testing the one e to the minus four and you can see straight away it goes down way quicker so it's like okay we're gonna go with that it's clearly a lot better and then continued over here next epoch and then find the final one here which it didn't seem to change much anyway but um there was there was so pretty clear difference so that's the the loss over time and yeah i mean we've seen the results from that so now we have that let's move on to actually uh testing the model so i'm going to bring larry and i'm going to just open the the file okay so this is the the testing we're going to do so we're using the the file mask we've got this pipeline um sorry film us i've got this pipeline and we're just what i'm going to do is just get lara to come in and some italian sentences and just add this random mass token in and see if the results are bearable or not so let's see um so i will see you in a minute this is lara she can speak italian so she's going to go through this and test it a few times and hopefully say it's good let's see hopefully ciao okay so all you need to do is we have like a sentence here and you just write some italian and then for one of the words in there we want to replace it with this text here and then that's going to like mask that word and then the model is going to try and predict what is there and hopefully we'll predict let's uh let's see so just write some italian phrases not not too difficult yet and see so i i don't have to write all bar no no no no you right just write a sentence and okay maybe a few words there i'll just okay can i put comma or yeah okay and then so which word should we cover cornmeal okay and then okay so just cover it with the mask and see what it says so not this i seem to rerun these as well okay let's give it a moment yeah but the second one coming back it's almost there does kiva mean anything like who yeah it's like like is there someone but we like i understand because i'm italian but i don't think that um we don't usually say that i don't think i'm gonna say that it's fine i'm gonna say that as it's good so let's do it again maybe yeah try another one oh wait actually what about these ones no no no definitely not right no [Music] no no okay but it just like be after bungiorno i i wouldn't expect uh cuba cuba it's okay okay so you can just put another one like where we put phil again right in the sentence so we're here yeah so we can ride yeah um yeah and then what do you want to replace maybe as well okay yeah so which one you decide it's fine i think it's interesting okay let's try it yeah that's good [Music] yeah that's that's quite good yeah that's cool okay should we try with dober like using the same phrase but okay pink control z right okay let's run it in the second one let's try another one yeah um [Music] okay let's remove prepare [Music] grammatically difficult when it's like that you know i don't i don't know like doesn't that's very good no but it's good because obvious cimo it's for uh third person plural it's like we had obviously is third person singular so if he had or she had yeah choosing something what does this actually mean so what would have happened if we had chosen another day so the first one say ave will be the third person you will be the first person so if i had chosen another day say it's a second person plural so it will be if they had choosing another day uh say they should th this one now see how the dishes yeah this is good seven is a shield as well no maybe no no but the first three are very good yeah i have an idea so now if we change to say so if we put se loro so if we specify the person maybe we'll take the correct one so if we put say loro and then we expect it to say so let's run it that's cool that's very good and then the other ones is right i mean the the i'm saying well the the verb it's uh incorrect but yes and it's in the wrong place but it's saying the right uh like the meaning is correct yeah but the grammar it's not correct okay all right okay yeah oh it's cool you want something it actually worked because i wasn't sure if i could just um wait worked a child coming back but that was all i tested it with so i was a little bit worried though it might not do anything else but thank you you're welcome thanks bye okay so i think that's a pretty good result so i mean that's pretty much everything we we needed for for building our model our transform model although i do want to so we're going to do one more video after this where we're going to upload our model to the hooking face model hub and then what we'll be able to do is actually download it directly from hogging face which i think will be will be super cool to to do that and figure out how we actually put all that together so yeah i think good result pretty happy with that and thank you for watching and i will see you again in the next one

Original Description

We need two things for training, our DataLoader and a model. The DataLoader we have — but no model. For training, we need a raw (not pre-trained) RobertaForMaskedLM. To create that, we first need to create a RoBERTa config object to describe the parameters we’d like to initialize FiliBERTo with. Once we have our model, we set up our training loop and train! Post-training, we'll test the model with Laura, who is Italian - and hope for the best. Part 1: https://youtu.be/GhGUZrcB-WM Part 2: https://youtu.be/JIeAB8vvBQo Part 3: https://youtu.be/heTYbpr9mD8 --- 📙 Medium article: https://towardsdatascience.com/how-to-train-a-bert-model-from-scratch-72cfce554fc6 📖 If membership is too expensive - here's a free link: https://towardsdatascience.com/how-to-train-a-bert-model-from-scratch-72cfce554fc6?sk=9db6224efbd4ec6fd407a80b528e69b0 🤖 70% Discount on the NLP With Transformers in Python course: https://bit.ly/3DFvvY5 👾 Discord https://discord.gg/c5QtDB9RAP 🕹️ Free AI-Powered Code Refactoring with Sourcery: https://sourcery.ai/?utm_source=YouTub&utm_campaign=JBriggs&utm_medium=aff 00:00 Intro 00:35 Review of Code 02:02 Config Object 06:28 Setup For Training 10:30 Training Loop 14:57 Dealing With CUDA Errors 16:17 Training Results 19:52 Loss 21:18 Fill-mask Pipeline For Testing 21:54 Testing With Laura

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from James Briggs · James Briggs · 47 of 60

← Previous Next →

Stoic Philosophy Text Generation with TensorFlow

Stoic Philosophy Text Generation with TensorFlow

How to Build TensorFlow Pipelines with tf.data.Dataset

How to Build TensorFlow Pipelines with tf.data.Dataset

Every New Feature in Python 3.10.0a2

Every New Feature in Python 3.10.0a2

How-to Build a Transformer for Language Classification in TensorFlow

How-to Build a Transformer for Language Classification in TensorFlow

How-to use the Kaggle API in Python

How-to use the Kaggle API in Python

Language Generation with OpenAI's GPT-2 in Python

Language Generation with OpenAI's GPT-2 in Python

Text Summarization with Google AI's T5 in Python

Text Summarization with Google AI's T5 in Python

How-to do Sentiment Analysis with Flair in Python

How-to do Sentiment Analysis with Flair in Python

Python Environment Setup for Machine Learning

Python Environment Setup for Machine Learning

Sequential Model - TensorFlow Essentials #1

Sequential Model - TensorFlow Essentials #1

Functional API - TensorFlow Essentials #2

Functional API - TensorFlow Essentials #2

Training Parameters - TensorFlow Essentials #3

Training Parameters - TensorFlow Essentials #3

Input Data Pipelines - TensorFlow Essentials #4

Input Data Pipelines - TensorFlow Essentials #4

6 of Python's Newest and Best Features (3.7-3.9)

6 of Python's Newest and Best Features (3.7-3.9)

Novice to Advanced RegEx in Less-than 30 Minutes + Python

Novice to Advanced RegEx in Less-than 30 Minutes + Python

Building a PlotLy $GME Chart in Python

Building a PlotLy $GME Chart in Python

How-to Use The Reddit API in Python

How-to Use The Reddit API in Python

How to Build Custom Q&A Transformer Models in Python

How to Build Custom Q&A Transformer Models in Python

How to Build Q&A Models in Python (Transformers)

How to Build Q&A Models in Python (Transformers)

How-to Decode Outputs From NLP Models (Python)

How-to Decode Outputs From NLP Models (Python)

Identify Stocks on Reddit with SpaCy (NER in Python)

Identify Stocks on Reddit with SpaCy (NER in Python)

Sentiment Analysis on ANY Length of Text With Transformers (Python)

Sentiment Analysis on ANY Length of Text With Transformers (Python)

Unicode Normalization for NLP in Python

Unicode Normalization for NLP in Python

The NEW Match-Case Statement in Python 3.10

The NEW Match-Case Statement in Python 3.10

Multi-Class Language Classification With BERT in TensorFlow

Multi-Class Language Classification With BERT in TensorFlow

How to Build Python Packages for Pip

How to Build Python Packages for Pip

How-to Structure a Q&A ML App

How-to Structure a Q&A ML App

How to Index Q&A Data With Haystack and Elasticsearch

How to Index Q&A Data With Haystack and Elasticsearch

Q&A Document Retrieval With DPR

Q&A Document Retrieval With DPR

How to Use Type Annotations in Python

How to Use Type Annotations in Python

Extractive Q&A With Haystack and FastAPI in Python

Extractive Q&A With Haystack and FastAPI in Python

Sentence Similarity With Sentence-Transformers in Python

Sentence Similarity With Sentence-Transformers in Python

Sentence Similarity With Transformers and PyTorch (Python)

Sentence Similarity With Transformers and PyTorch (Python)

NER With Transformers and spaCy (Python)

NER With Transformers and spaCy (Python)

Training BERT #1 - Masked-Language Modeling (MLM)

Training BERT #1 - Masked-Language Modeling (MLM)

Training BERT #2 - Train With Masked-Language Modeling (MLM)

Training BERT #2 - Train With Masked-Language Modeling (MLM)

Training BERT #3 - Next Sentence Prediction (NSP)

Training BERT #3 - Next Sentence Prediction (NSP)

Training BERT #4 - Train With Next Sentence Prediction (NSP)

Training BERT #4 - Train With Next Sentence Prediction (NSP)

FREE 11 Hour NLP Transformers Course (Next 3 Days Only)

FREE 11 Hour NLP Transformers Course (Next 3 Days Only)

New Features in Python 3.10

New Features in Python 3.10

Training BERT #5 - Training With BertForPretraining

Training BERT #5 - Training With BertForPretraining

How-to Use HuggingFace's Datasets - Transformers From Scratch #1

How-to Use HuggingFace's Datasets - Transformers From Scratch #1

Build a Custom Transformer Tokenizer - Transformers From Scratch #2

Build a Custom Transformer Tokenizer - Transformers From Scratch #2

3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)

3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

Building MLM Training Input Pipeline - Transformers From Scratch #3

Building MLM Training Input Pipeline - Transformers From Scratch #3

Training and Testing an Italian BERT - Transformers From Scratch #4

Training and Testing an Italian BERT - Transformers From Scratch #4

Faiss - Introduction to Similarity Search

Faiss - Introduction to Similarity Search

Angular App Setup With Material - Stoic Q&A #5

Angular App Setup With Material - Stoic Q&A #5

Why are there so many Tokenization methods in HF Transformers?

Why are there so many Tokenization methods in HF Transformers?

Choosing Indexes for Similarity Search (Faiss in Python)

Choosing Indexes for Similarity Search (Faiss in Python)

Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)

Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)

How LSH Random Projection works in search (+Python)

How LSH Random Projection works in search (+Python)

IndexLSH for Fast Similarity Search in Faiss

IndexLSH for Fast Similarity Search in Faiss

Faiss - Vector Compression with PQ and IVFPQ (in Python)

Faiss - Vector Compression with PQ and IVFPQ (in Python)

Product Quantization for Vector Similarity Search (+ Python)

Product Quantization for Vector Similarity Search (+ Python)

How to Build a Bert WordPiece Tokenizer in Python and HuggingFace

How to Build a Bert WordPiece Tokenizer in Python and HuggingFace

Metadata Filtering for Vector Search + Latest Filter Tech

Metadata Filtering for Vector Search + Latest Filter Tech

Build NLP Pipelines with HuggingFace Datasets

Build NLP Pipelines with HuggingFace Datasets

Composite Indexes and the Faiss Index Factory

Composite Indexes and the Faiss Index Factory

This video teaches how to train and test an Italian BERT model from scratch using the Transformers library, and how to fine-tune it for better performance. It also covers testing the model with Italian sentences and verb conjugations, and planning to upload the model to the Hugging Face Model Hub. The key takeaways are the importance of configuring the Roberta model correctly, and the need to fine-tune the model for optimal performance.

Key Takeaways

Create a data loader from a dataset
Configure the Roberta model with a vocab size, max position embedding, hidden size, number of attention heads, and number of hidden layers
Initialize the Roberta model with the configuration object
Move model to device
Activate training mode
Set optimizer with weight decay
Create training loop with tqdm
Train for 2 epochs
Initialize gradient in optimizer with zero_grad
Move input ids and attention mask to GPU or CPU

💡 The key to successful training of a language model is to configure the model correctly and fine-tune it for optimal performance. The choice of optimizer and learning rate can significantly impact model performance.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

Embeddings Simplified

Learn the basics of embeddings and how they simplify complex data, a crucial concept in AI and ML

I built a tool that cuts Claude/ChatGPT token usage by 97% — here's how it works

Learn how to build a tool that reduces Claude/ChatGPT token usage by 97%, increasing productivity and efficiency in debugging and development

Dev.to · Rohith Matam

Building LSTMs with PyTorch and Lightning AI Part 7: Resuming Training with Checkpoints

Learn to resume LSTM training with checkpoints using PyTorch and Lightning AI, enabling efficient model iteration and development

Dev.to · Rijul Rajesh

How AI Learns with Less Labeled Data

Learn how AI can learn with less labeled data, a crucial aspect of machine learning beyond model selection

Chapters (10)

Intro

0:35 Review of Code

2:02 Config Object

6:28 Setup For Training

10:30 Training Loop

14:57 Dealing With CUDA Errors

16:17 Training Results

19:52 Loss

21:18 Fill-mask Pipeline For Testing

21:54 Testing With Laura

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)