Training BERT #3 - Next Sentence Prediction (NSP)

James Briggs · Advanced ·🧠 Large Language Models ·5y ago

Key Takeaways

This video demonstrates the use of Next Sentence Prediction (NSP) in training BERT models, including fine-tuning and text classification tasks, utilizing tools such as Hugging Face Transformers and PyTorch.

Full Transcript

hi welcome to the video here we're going to have a look at using net sentence prediction or nsp for fine-tuning our birth models now a few of the previous videos we covered mass language modeling and how we use mass language modeling to fine-tune our models nsp is like the other half of fine-tuning for bert so both of those techniques during the actual training of bert so when google train bert initially they use both of these methods and whereas mlm is identifying or almost training on the relationships between words next sentence prediction is training on more long-term relationships between sentences rather than words and in the original paper it was found that without nsp because they tried training but without nsp as well but performed worse on every single metric so it is pretty important and obviously if we take this approach we take mass language modeling and nsp and apply both those to our training our models fine-tuning our models we're going to get better results and if we just use mlm so what is nsp nsp consists of giving birth two sentences sentence a and sentence b and saying paper does sentence b come after sentence a and then bear will say okay sentence b is the next sentence after sentence a or it is not the next sentence after the sentence a so if we took these three sentences that are on the screen we have one two and three right one and two if you ask but does sentence two come after sentence one then we'd kind of want bert to say no right because clearly they're on they're talking about completely different topics and the type of language and everything in there just doesn't really match up but then if we have a look at sentence three and sentence one they do match up so sentence three is quite possibly a the follow-on sentence after sentence one so in that case we would expect bert say this is the next sentence so let's have a look at how nsp looks within bert itself so here we have decor model and during fine tuning or pre-training we add this other head on top of bird so this is the bert for pre-training head and the bert for pre-training head contains two different heads inside it and that is our nsp head and our mass language modeling head now we just want to focus for on the nsp head for now and as well we don't need to fine tune or train our models with both these heads we can actually do it one by one we could use it mass language modeling only or we could use nsp only but the full approach to pre-training bert is using both so if we have a look inside our nsp head we'll find that we have a feed forward neural network and that will output two different values now these two values are our is not the next sequence there and how is the next sequence which is there okay so value 0 is the next sentence value 1 is not the next sentence now we have the final outputs from our final encoder invert at the bottom here and we don't actually use all of these activations we only use the cls token activation which is over at the left here so this here is our cls token okay and when i say this is our cls token i mean more that this is not our cls token the cls token is down here so we input the cls token and this output is the subsequent output after being processed by 12 or so encoders within bert itself so this is the output representation of that cls token now the activations from that get fed into our feed forward neural network and the dimensionality that we have here is 768 for that single token this is in the bert base model by the way and that gets translated into our damage to here which is just the two outputs so that's essentially how nsp works uh once we once we have our two outputs here we just take the arg max of both of those so we take both over here and we just take an arg max function of that and that will output us either 0 or 1 where 0 is the is next class and one is the not next class and that's how nsp works so let's dive into the code and see how all this works in python okay so we're going to be using hogging faces transformers and pytorch so we'll import to both of those and from transformers we just need the bert tokenizer class and the bert for next sentence prediction class and bert next sentence prediction then we also want to import torch and we're going to use 2 sentences here so both of these are from the wikipedia page on the american civil war and these are both consecutive sentences so going back to what we looked at before we would be hoping that bert would output a zero label for both of these because they are because sentence b is the next sentence after sentence a this one being sentence b this one being sentence a execute that and we now have three different steps that we need to take and that is tokenization create a classification label so the zero order 1 so that we can train the model and then from that we calculate the loss so the first step there is tokenization so we tokenize it's pretty easy all we do is inputs tokenizer and then we pass text and text two and we are using pie torch here so i want to return a pie torch tensor make sure that's pt now we need to also initialize those so tokenizer equals bert tokenizer from pre-trained and we'll just use bert base on case for now obviously you can use another bert model if you if you want and i'm just going to copy that and initialize our model as well okay now rerun that and we'll get this warning that's because we're using these models that are used for training or for fine tuning so it's just telling us that we shouldn't really use this for inference you need to train it first and that's fine because that's our intention now from these inputs we'll get a few different tensors so we have input ids token type ids and attention mask now for next sentence prediction we do need all of these so this is a little bit different to mass language modeling with mass language modeling we don't actually need token type ids but for net sensor prediction we do so let's have a look at what we have inside these so input ids it's just our tokenize text and you see that we pass these two sentences here and they're actually both within the same sentence or the same tendency here input ids and they're separated by this one zero two in the middle which is a separator token so before that all these tokens that is our text variable or sentence a and then afterwards we have our text two variable which is sentence b and we can see this mirrored in the token type ids tensor as well so all the way along here up to here that's our sentence a so we have zeros for sentence a and then following that we have ones representing sentence b and then we have our tension mask which is just ones because the attention mask is a one way it's a real token and a zero where we have padding tokens so we don't need to really worry about that tensor at all now the next step here is that we need to create a labels tensor so to do that we'll just write labels and we just need to make sure that when we do this we use a long tensor okay so we use a long tensor and in here we need to pass a list containing a single value which is either our zero but is the next sentence or one for is not the next sentence in our case our two sentences are supposed to be together so we would pass a zero in here and run that and if we have a look at what we get from there we see that we get this integer tensor so now we're ready to calculate our loss which is really easy so we have our model up here which we have already initialized so we just take that and all we do is pass our inputs from here into our model is keyword arguments so that's what these two symbols are for and then we also pass labels to the labels parameter okay and that will output a couple of tenses for us so we can execute that and let's have a look at what we have so you see that we get these two tensors we have the logits and we also have the loss tensor so let's have a look at the logits and we should be able to recognize this from earlier on where we saw those two nodes and we had the two values one for the index zero for is next and the index one for is not next so let's have a look you can see here that we get both of those so this is our activation for is the next sentence this is our activation for is not the next sentence and if we were to take the arg max of those outputs logics we get zero which means it is the next sentence okay and we also have the loss and this loss tensor that will only be output if we pass our labels here otherwise we just get a logit to test that so when we're training obviously we need labels so that we can calculate the loss and if we just have a look at that we'll see it's just a loss value which is very small because the model is predicting a zero and the label that we've provided is also a zero so the losses is pretty good there so that is how nsp works obviously it's slightly different if you're actually training your model and i am going to cover that in the next video so i'll leave a link to that in the description but for now that's it for this so thank you very much for watching and i'll see you again in the next one

Original Description

Next sentence prediction (NSP) is one-half of the training process behind the BERT model (the other being masked-language modeling - MLM). Where MLM teaches BERT to understand relationships between words - NSP teaches BERT to understand relationships between sentences. In the original BERT paper, it was found that without NSP, BERT performed worse on every single metric -  so it's important. Now, when we use a pre-trained BERT model, training with NSP and MLM has already been done, so why do we need to know about it? Well, we can actually further pre-train these pre-trained BERT models so that they better understand the language used in our specific use-cases. To do that, we can use both MLM and NSP. So, in this video, we'll go into depth on what NSP is, how it works, and how we can implement it in code. Training with NSP: https://youtu.be/x1lAcT3xl5M 🤖 70% Discount on the NLP With Transformers in Python course: https://bit.ly/3DFvvY5 📙 Medium article: https://towardsdatascience.com/bert-for-next-sentence-prediction-466b67f8226f 🎉 Sign-up For New Articles Every Week on Medium! https://medium.com/@jamescalam/membership 📖 If membership is too expensive - here's a free link: https://towardsdatascience.com/bert-for-next-sentence-prediction-466b67f8226f?sk=3595968413abde1c5833e1a96e449673 🕹️ Free AI-Powered Code Refactoring with Sourcery: https://sourcery.ai/?utm_source=YouTub&utm_campaign=JBriggs&utm_medium=aff
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from James Briggs · James Briggs · 37 of 60

1 Stoic Philosophy Text Generation with TensorFlow
Stoic Philosophy Text Generation with TensorFlow
James Briggs
2 How to Build TensorFlow Pipelines with tf.data.Dataset
How to Build TensorFlow Pipelines with tf.data.Dataset
James Briggs
3 Every New Feature in Python 3.10.0a2
Every New Feature in Python 3.10.0a2
James Briggs
4 How-to Build a Transformer for Language Classification in TensorFlow
How-to Build a Transformer for Language Classification in TensorFlow
James Briggs
5 How-to use the Kaggle API in Python
How-to use the Kaggle API in Python
James Briggs
6 Language Generation with OpenAI's GPT-2 in Python
Language Generation with OpenAI's GPT-2 in Python
James Briggs
7 Text Summarization with Google AI's T5 in Python
Text Summarization with Google AI's T5 in Python
James Briggs
8 How-to do Sentiment Analysis with Flair in Python
How-to do Sentiment Analysis with Flair in Python
James Briggs
9 Python Environment Setup for Machine Learning
Python Environment Setup for Machine Learning
James Briggs
10 Sequential Model - TensorFlow Essentials #1
Sequential Model - TensorFlow Essentials #1
James Briggs
11 Functional API - TensorFlow Essentials #2
Functional API - TensorFlow Essentials #2
James Briggs
12 Training Parameters - TensorFlow Essentials #3
Training Parameters - TensorFlow Essentials #3
James Briggs
13 Input Data Pipelines - TensorFlow Essentials #4
Input Data Pipelines - TensorFlow Essentials #4
James Briggs
14 6 of Python's Newest and Best Features (3.7-3.9)
6 of Python's Newest and Best Features (3.7-3.9)
James Briggs
15 Novice to Advanced RegEx in Less-than 30 Minutes + Python
Novice to Advanced RegEx in Less-than 30 Minutes + Python
James Briggs
16 Building a PlotLy $GME Chart in Python
Building a PlotLy $GME Chart in Python
James Briggs
17 How-to Use The Reddit API in Python
How-to Use The Reddit API in Python
James Briggs
18 How to Build Custom Q&A Transformer Models in Python
How to Build Custom Q&A Transformer Models in Python
James Briggs
19 How to Build Q&A Models in Python (Transformers)
How to Build Q&A Models in Python (Transformers)
James Briggs
20 How-to Decode Outputs From NLP Models (Python)
How-to Decode Outputs From NLP Models (Python)
James Briggs
21 Identify Stocks on Reddit with SpaCy (NER in Python)
Identify Stocks on Reddit with SpaCy (NER in Python)
James Briggs
22 Sentiment Analysis on ANY Length of Text With Transformers (Python)
Sentiment Analysis on ANY Length of Text With Transformers (Python)
James Briggs
23 Unicode Normalization for NLP in Python
Unicode Normalization for NLP in Python
James Briggs
24 The NEW Match-Case Statement in Python 3.10
The NEW Match-Case Statement in Python 3.10
James Briggs
25 Multi-Class Language Classification With BERT in TensorFlow
Multi-Class Language Classification With BERT in TensorFlow
James Briggs
26 How to Build Python Packages for Pip
How to Build Python Packages for Pip
James Briggs
27 How-to Structure a Q&A ML App
How-to Structure a Q&A ML App
James Briggs
28 How to Index Q&A Data With Haystack and Elasticsearch
How to Index Q&A Data With Haystack and Elasticsearch
James Briggs
29 Q&A Document Retrieval With DPR
Q&A Document Retrieval With DPR
James Briggs
30 How to Use Type Annotations in Python
How to Use Type Annotations in Python
James Briggs
31 Extractive Q&A With Haystack and FastAPI in Python
Extractive Q&A With Haystack and FastAPI in Python
James Briggs
32 Sentence Similarity With Sentence-Transformers in Python
Sentence Similarity With Sentence-Transformers in Python
James Briggs
33 Sentence Similarity With Transformers and PyTorch (Python)
Sentence Similarity With Transformers and PyTorch (Python)
James Briggs
34 NER With Transformers and spaCy (Python)
NER With Transformers and spaCy (Python)
James Briggs
35 Training BERT #1 - Masked-Language Modeling (MLM)
Training BERT #1 - Masked-Language Modeling (MLM)
James Briggs
36 Training BERT #2 - Train With Masked-Language Modeling (MLM)
Training BERT #2 - Train With Masked-Language Modeling (MLM)
James Briggs
Training BERT #3 - Next Sentence Prediction (NSP)
Training BERT #3 - Next Sentence Prediction (NSP)
James Briggs
38 Training BERT #4 - Train With Next Sentence Prediction (NSP)
Training BERT #4 - Train With Next Sentence Prediction (NSP)
James Briggs
39 FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
James Briggs
40 New Features in Python 3.10
New Features in Python 3.10
James Briggs
41 Training BERT #5 - Training With BertForPretraining
Training BERT #5 - Training With BertForPretraining
James Briggs
42 How-to Use HuggingFace's Datasets - Transformers From Scratch #1
How-to Use HuggingFace's Datasets - Transformers From Scratch #1
James Briggs
43 Build a Custom Transformer Tokenizer - Transformers From Scratch #2
Build a Custom Transformer Tokenizer - Transformers From Scratch #2
James Briggs
44 3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
James Briggs
45 3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
James Briggs
46 Building MLM Training Input Pipeline - Transformers From Scratch #3
Building MLM Training Input Pipeline - Transformers From Scratch #3
James Briggs
47 Training and Testing an Italian BERT - Transformers From Scratch #4
Training and Testing an Italian BERT - Transformers From Scratch #4
James Briggs
48 Faiss - Introduction to Similarity Search
Faiss - Introduction to Similarity Search
James Briggs
49 Angular App Setup With Material - Stoic Q&A #5
Angular App Setup With Material - Stoic Q&A #5
James Briggs
50 Why are there so many Tokenization methods in HF Transformers?
Why are there so many Tokenization methods in HF Transformers?
James Briggs
51 Choosing Indexes for Similarity Search (Faiss in Python)
Choosing Indexes for Similarity Search (Faiss in Python)
James Briggs
52 Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
James Briggs
53 How LSH Random Projection works in search (+Python)
How LSH Random Projection works in search (+Python)
James Briggs
54 IndexLSH for Fast Similarity Search in Faiss
IndexLSH for Fast Similarity Search in Faiss
James Briggs
55 Faiss - Vector Compression with PQ and IVFPQ (in Python)
Faiss - Vector Compression with PQ and IVFPQ (in Python)
James Briggs
56 Product Quantization for Vector Similarity Search (+ Python)
Product Quantization for Vector Similarity Search (+ Python)
James Briggs
57 How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
James Briggs
58 Metadata Filtering for Vector Search + Latest Filter Tech
Metadata Filtering for Vector Search + Latest Filter Tech
James Briggs
59 Build NLP Pipelines with HuggingFace Datasets
Build NLP Pipelines with HuggingFace Datasets
James Briggs
60 Composite Indexes and the Faiss Index Factory
Composite Indexes and the Faiss Index Factory
James Briggs

This video teaches how to use Next Sentence Prediction (NSP) to train BERT models, including fine-tuning and text classification tasks, and demonstrates the importance of NSP in understanding sentence relationships. By following this video, viewers can learn how to utilize NSP to improve their BERT models and apply them to various NLP tasks. The video also highlights the use of tools such as Hugging Face Transformers and PyTorch for NSP tasks.

Key Takeaways
  1. Tokenize input text using Hugging Face Transformers and PyTorch
  2. Create a classification label (0 or 1) for the next sentence prediction task
  3. Calculate the loss using the model and inputs
  4. Initialize model and tokenizer using PyTorch and Hugging Face Transformers
  5. Pass zero as label for next sentence prediction
  6. Calculate loss using model inputs and labels
  7. Get logits and loss tensors from model output
  8. Use arg max to get predicted next sentence index
💡 NSP is a crucial component of BERT training, allowing the model to understand relationships between sentences, and its absence can significantly impact the model's performance.

Related AI Lessons

The 2026 AI Model Release Race: Every Major LLM Launch You Need to Know
Stay updated on the 2026 AI model release race, including major LLM launches like Claude Sonnet 5 and GPT-5.6, to leverage the latest advancements in AI technology
Dev.to AI
Call GPT, Claude, and Gemini from one API key — a 3-step setup
Access GPT, Claude, and Gemini through one API key with a 3-step setup using Modelishub
Dev.to AI
Your LLM Doesn’t Pick Stocks — It Remembers Them
Discover how LLMs remember stock picks rather than making actual predictions, and why this matters for AI-driven investment strategies
Medium · Machine Learning
Word Representation
Learn how word representation works in NLP and its importance in understanding human language, enabling applications like text classification and language translation
Medium · NLP
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →