Training BERT #3 - Next Sentence Prediction (NSP)
Key Takeaways
This video demonstrates the use of Next Sentence Prediction (NSP) in training BERT models, including fine-tuning and text classification tasks, utilizing tools such as Hugging Face Transformers and PyTorch.
Full Transcript
hi welcome to the video here we're going to have a look at using net sentence prediction or nsp for fine-tuning our birth models now a few of the previous videos we covered mass language modeling and how we use mass language modeling to fine-tune our models nsp is like the other half of fine-tuning for bert so both of those techniques during the actual training of bert so when google train bert initially they use both of these methods and whereas mlm is identifying or almost training on the relationships between words next sentence prediction is training on more long-term relationships between sentences rather than words and in the original paper it was found that without nsp because they tried training but without nsp as well but performed worse on every single metric so it is pretty important and obviously if we take this approach we take mass language modeling and nsp and apply both those to our training our models fine-tuning our models we're going to get better results and if we just use mlm so what is nsp nsp consists of giving birth two sentences sentence a and sentence b and saying paper does sentence b come after sentence a and then bear will say okay sentence b is the next sentence after sentence a or it is not the next sentence after the sentence a so if we took these three sentences that are on the screen we have one two and three right one and two if you ask but does sentence two come after sentence one then we'd kind of want bert to say no right because clearly they're on they're talking about completely different topics and the type of language and everything in there just doesn't really match up but then if we have a look at sentence three and sentence one they do match up so sentence three is quite possibly a the follow-on sentence after sentence one so in that case we would expect bert say this is the next sentence so let's have a look at how nsp looks within bert itself so here we have decor model and during fine tuning or pre-training we add this other head on top of bird so this is the bert for pre-training head and the bert for pre-training head contains two different heads inside it and that is our nsp head and our mass language modeling head now we just want to focus for on the nsp head for now and as well we don't need to fine tune or train our models with both these heads we can actually do it one by one we could use it mass language modeling only or we could use nsp only but the full approach to pre-training bert is using both so if we have a look inside our nsp head we'll find that we have a feed forward neural network and that will output two different values now these two values are our is not the next sequence there and how is the next sequence which is there okay so value 0 is the next sentence value 1 is not the next sentence now we have the final outputs from our final encoder invert at the bottom here and we don't actually use all of these activations we only use the cls token activation which is over at the left here so this here is our cls token okay and when i say this is our cls token i mean more that this is not our cls token the cls token is down here so we input the cls token and this output is the subsequent output after being processed by 12 or so encoders within bert itself so this is the output representation of that cls token now the activations from that get fed into our feed forward neural network and the dimensionality that we have here is 768 for that single token this is in the bert base model by the way and that gets translated into our damage to here which is just the two outputs so that's essentially how nsp works uh once we once we have our two outputs here we just take the arg max of both of those so we take both over here and we just take an arg max function of that and that will output us either 0 or 1 where 0 is the is next class and one is the not next class and that's how nsp works so let's dive into the code and see how all this works in python okay so we're going to be using hogging faces transformers and pytorch so we'll import to both of those and from transformers we just need the bert tokenizer class and the bert for next sentence prediction class and bert next sentence prediction then we also want to import torch and we're going to use 2 sentences here so both of these are from the wikipedia page on the american civil war and these are both consecutive sentences so going back to what we looked at before we would be hoping that bert would output a zero label for both of these because they are because sentence b is the next sentence after sentence a this one being sentence b this one being sentence a execute that and we now have three different steps that we need to take and that is tokenization create a classification label so the zero order 1 so that we can train the model and then from that we calculate the loss so the first step there is tokenization so we tokenize it's pretty easy all we do is inputs tokenizer and then we pass text and text two and we are using pie torch here so i want to return a pie torch tensor make sure that's pt now we need to also initialize those so tokenizer equals bert tokenizer from pre-trained and we'll just use bert base on case for now obviously you can use another bert model if you if you want and i'm just going to copy that and initialize our model as well okay now rerun that and we'll get this warning that's because we're using these models that are used for training or for fine tuning so it's just telling us that we shouldn't really use this for inference you need to train it first and that's fine because that's our intention now from these inputs we'll get a few different tensors so we have input ids token type ids and attention mask now for next sentence prediction we do need all of these so this is a little bit different to mass language modeling with mass language modeling we don't actually need token type ids but for net sensor prediction we do so let's have a look at what we have inside these so input ids it's just our tokenize text and you see that we pass these two sentences here and they're actually both within the same sentence or the same tendency here input ids and they're separated by this one zero two in the middle which is a separator token so before that all these tokens that is our text variable or sentence a and then afterwards we have our text two variable which is sentence b and we can see this mirrored in the token type ids tensor as well so all the way along here up to here that's our sentence a so we have zeros for sentence a and then following that we have ones representing sentence b and then we have our tension mask which is just ones because the attention mask is a one way it's a real token and a zero where we have padding tokens so we don't need to really worry about that tensor at all now the next step here is that we need to create a labels tensor so to do that we'll just write labels and we just need to make sure that when we do this we use a long tensor okay so we use a long tensor and in here we need to pass a list containing a single value which is either our zero but is the next sentence or one for is not the next sentence in our case our two sentences are supposed to be together so we would pass a zero in here and run that and if we have a look at what we get from there we see that we get this integer tensor so now we're ready to calculate our loss which is really easy so we have our model up here which we have already initialized so we just take that and all we do is pass our inputs from here into our model is keyword arguments so that's what these two symbols are for and then we also pass labels to the labels parameter okay and that will output a couple of tenses for us so we can execute that and let's have a look at what we have so you see that we get these two tensors we have the logits and we also have the loss tensor so let's have a look at the logits and we should be able to recognize this from earlier on where we saw those two nodes and we had the two values one for the index zero for is next and the index one for is not next so let's have a look you can see here that we get both of those so this is our activation for is the next sentence this is our activation for is not the next sentence and if we were to take the arg max of those outputs logics we get zero which means it is the next sentence okay and we also have the loss and this loss tensor that will only be output if we pass our labels here otherwise we just get a logit to test that so when we're training obviously we need labels so that we can calculate the loss and if we just have a look at that we'll see it's just a loss value which is very small because the model is predicting a zero and the label that we've provided is also a zero so the losses is pretty good there so that is how nsp works obviously it's slightly different if you're actually training your model and i am going to cover that in the next video so i'll leave a link to that in the description but for now that's it for this so thank you very much for watching and i'll see you again in the next one
Original Description
Next sentence prediction (NSP) is one-half of the training process behind the BERT model (the other being masked-language modeling - MLM).
Where MLM teaches BERT to understand relationships between words - NSP teaches BERT to understand relationships between sentences.
In the original BERT paper, it was found that without NSP, BERT performed worse on every single metric - so it's important.
Now, when we use a pre-trained BERT model, training with NSP and MLM has already been done, so why do we need to know about it?
Well, we can actually further pre-train these pre-trained BERT models so that they better understand the language used in our specific use-cases. To do that, we can use both MLM and NSP.
So, in this video, we'll go into depth on what NSP is, how it works, and how we can implement it in code.
Training with NSP:
https://youtu.be/x1lAcT3xl5M
🤖 70% Discount on the NLP With Transformers in Python course:
https://bit.ly/3DFvvY5
📙 Medium article:
https://towardsdatascience.com/bert-for-next-sentence-prediction-466b67f8226f
🎉 Sign-up For New Articles Every Week on Medium!
https://medium.com/@jamescalam/membership
📖 If membership is too expensive - here's a free link:
https://towardsdatascience.com/bert-for-next-sentence-prediction-466b67f8226f?sk=3595968413abde1c5833e1a96e449673
🕹️ Free AI-Powered Code Refactoring with Sourcery:
https://sourcery.ai/?utm_source=YouTub&utm_campaign=JBriggs&utm_medium=aff
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from James Briggs · James Briggs · 37 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
▶
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Stoic Philosophy Text Generation with TensorFlow
James Briggs
How to Build TensorFlow Pipelines with tf.data.Dataset
James Briggs
Every New Feature in Python 3.10.0a2
James Briggs
How-to Build a Transformer for Language Classification in TensorFlow
James Briggs
How-to use the Kaggle API in Python
James Briggs
Language Generation with OpenAI's GPT-2 in Python
James Briggs
Text Summarization with Google AI's T5 in Python
James Briggs
How-to do Sentiment Analysis with Flair in Python
James Briggs
Python Environment Setup for Machine Learning
James Briggs
Sequential Model - TensorFlow Essentials #1
James Briggs
Functional API - TensorFlow Essentials #2
James Briggs
Training Parameters - TensorFlow Essentials #3
James Briggs
Input Data Pipelines - TensorFlow Essentials #4
James Briggs
6 of Python's Newest and Best Features (3.7-3.9)
James Briggs
Novice to Advanced RegEx in Less-than 30 Minutes + Python
James Briggs
Building a PlotLy $GME Chart in Python
James Briggs
How-to Use The Reddit API in Python
James Briggs
How to Build Custom Q&A Transformer Models in Python
James Briggs
How to Build Q&A Models in Python (Transformers)
James Briggs
How-to Decode Outputs From NLP Models (Python)
James Briggs
Identify Stocks on Reddit with SpaCy (NER in Python)
James Briggs
Sentiment Analysis on ANY Length of Text With Transformers (Python)
James Briggs
Unicode Normalization for NLP in Python
James Briggs
The NEW Match-Case Statement in Python 3.10
James Briggs
Multi-Class Language Classification With BERT in TensorFlow
James Briggs
How to Build Python Packages for Pip
James Briggs
How-to Structure a Q&A ML App
James Briggs
How to Index Q&A Data With Haystack and Elasticsearch
James Briggs
Q&A Document Retrieval With DPR
James Briggs
How to Use Type Annotations in Python
James Briggs
Extractive Q&A With Haystack and FastAPI in Python
James Briggs
Sentence Similarity With Sentence-Transformers in Python
James Briggs
Sentence Similarity With Transformers and PyTorch (Python)
James Briggs
NER With Transformers and spaCy (Python)
James Briggs
Training BERT #1 - Masked-Language Modeling (MLM)
James Briggs
Training BERT #2 - Train With Masked-Language Modeling (MLM)
James Briggs
Training BERT #3 - Next Sentence Prediction (NSP)
James Briggs
Training BERT #4 - Train With Next Sentence Prediction (NSP)
James Briggs
FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
James Briggs
New Features in Python 3.10
James Briggs
Training BERT #5 - Training With BertForPretraining
James Briggs
How-to Use HuggingFace's Datasets - Transformers From Scratch #1
James Briggs
Build a Custom Transformer Tokenizer - Transformers From Scratch #2
James Briggs
3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
James Briggs
3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
James Briggs
Building MLM Training Input Pipeline - Transformers From Scratch #3
James Briggs
Training and Testing an Italian BERT - Transformers From Scratch #4
James Briggs
Faiss - Introduction to Similarity Search
James Briggs
Angular App Setup With Material - Stoic Q&A #5
James Briggs
Why are there so many Tokenization methods in HF Transformers?
James Briggs
Choosing Indexes for Similarity Search (Faiss in Python)
James Briggs
Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
James Briggs
How LSH Random Projection works in search (+Python)
James Briggs
IndexLSH for Fast Similarity Search in Faiss
James Briggs
Faiss - Vector Compression with PQ and IVFPQ (in Python)
James Briggs
Product Quantization for Vector Similarity Search (+ Python)
James Briggs
How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
James Briggs
Metadata Filtering for Vector Search + Latest Filter Tech
James Briggs
Build NLP Pipelines with HuggingFace Datasets
James Briggs
Composite Indexes and the Faiss Index Factory
James Briggs
More on: LLM Foundations
View skill →
🎓
Tutor Explanation
DeepCamp AI