Training BERT #3 - Next Sentence Prediction (NSP)

James Briggs · Advanced ·🧠 Large Language Models ·5y ago

Skills: LLM Foundations90%Fine-tuning LLMs80%Prompt Craft60%

Key Takeaways

This video demonstrates the use of Next Sentence Prediction (NSP) in training BERT models, including fine-tuning and text classification tasks, utilizing tools such as Hugging Face Transformers and PyTorch.

Full Transcript

hi welcome to the video here we're going to have a look at using net sentence prediction or nsp for fine-tuning our birth models now a few of the previous videos we covered mass language modeling and how we use mass language modeling to fine-tune our models nsp is like the other half of fine-tuning for bert so both of those techniques during the actual training of bert so when google train bert initially they use both of these methods and whereas mlm is identifying or almost training on the relationships between words next sentence prediction is training on more long-term relationships between sentences rather than words and in the original paper it was found that without nsp because they tried training but without nsp as well but performed worse on every single metric so it is pretty important and obviously if we take this approach we take mass language modeling and nsp and apply both those to our training our models fine-tuning our models we're going to get better results and if we just use mlm so what is nsp nsp consists of giving birth two sentences sentence a and sentence b and saying paper does sentence b come after sentence a and then bear will say okay sentence b is the next sentence after sentence a or it is not the next sentence after the sentence a so if we took these three sentences that are on the screen we have one two and three right one and two if you ask but does sentence two come after sentence one then we'd kind of want bert to say no right because clearly they're on they're talking about completely different topics and the type of language and everything in there just doesn't really match up but then if we have a look at sentence three and sentence one they do match up so sentence three is quite possibly a the follow-on sentence after sentence one so in that case we would expect bert say this is the next sentence so let's have a look at how nsp looks within bert itself so here we have decor model and during fine tuning or pre-training we add this other head on top of bird so this is the bert for pre-training head and the bert for pre-training head contains two different heads inside it and that is our nsp head and our mass language modeling head now we just want to focus for on the nsp head for now and as well we don't need to fine tune or train our models with both these heads we can actually do it one by one we could use it mass language modeling only or we could use nsp only but the full approach to pre-training bert is using both so if we have a look inside our nsp head we'll find that we have a feed forward neural network and that will output two different values now these two values are our is not the next sequence there and how is the next sequence which is there okay so value 0 is the next sentence value 1 is not the next sentence now we have the final outputs from our final encoder invert at the bottom here and we don't actually use all of these activations we only use the cls token activation which is over at the left here so this here is our cls token okay and when i say this is our cls token i mean more that this is not our cls token the cls token is down here so we input the cls token and this output is the subsequent output after being processed by 12 or so encoders within bert itself so this is the output representation of that cls token now the activations from that get fed into our feed forward neural network and the dimensionality that we have here is 768 for that single token this is in the bert base model by the way and that gets translated into our damage to here which is just the two outputs so that's essentially how nsp works uh once we once we have our two outputs here we just take the arg max of both of those so we take both over here and we just take an arg max function of that and that will output us either 0 or 1 where 0 is the is next class and one is the not next class and that's how nsp works so let's dive into the code and see how all this works in python okay so we're going to be using hogging faces transformers and pytorch so we'll import to both of those and from transformers we just need the bert tokenizer class and the bert for next sentence prediction class and bert next sentence prediction then we also want to import torch and we're going to use 2 sentences here so both of these are from the wikipedia page on the american civil war and these are both consecutive sentences so going back to what we looked at before we would be hoping that bert would output a zero label for both of these because they are because sentence b is the next sentence after sentence a this one being sentence b this one being sentence a execute that and we now have three different steps that we need to take and that is tokenization create a classification label so the zero order 1 so that we can train the model and then from that we calculate the loss so the first step there is tokenization so we tokenize it's pretty easy all we do is inputs tokenizer and then we pass text and text two and we are using pie torch here so i want to return a pie torch tensor make sure that's pt now we need to also initialize those so tokenizer equals bert tokenizer from pre-trained and we'll just use bert base on case for now obviously you can use another bert model if you if you want and i'm just going to copy that and initialize our model as well okay now rerun that and we'll get this warning that's because we're using these models that are used for training or for fine tuning so it's just telling us that we shouldn't really use this for inference you need to train it first and that's fine because that's our intention now from these inputs we'll get a few different tensors so we have input ids token type ids and attention mask now for next sentence prediction we do need all of these so this is a little bit different to mass language modeling with mass language modeling we don't actually need token type ids but for net sensor prediction we do so let's have a look at what we have inside these so input ids it's just our tokenize text and you see that we pass these two sentences here and they're actually both within the same sentence or the same tendency here input ids and they're separated by this one zero two in the middle which is a separator token so before that all these tokens that is our text variable or sentence a and then afterwards we have our text two variable which is sentence b and we can see this mirrored in the token type ids tensor as well so all the way along here up to here that's our sentence a so we have zeros for sentence a and then following that we have ones representing sentence b and then we have our tension mask which is just ones because the attention mask is a one way it's a real token and a zero where we have padding tokens so we don't need to really worry about that tensor at all now the next step here is that we need to create a labels tensor so to do that we'll just write labels and we just need to make sure that when we do this we use a long tensor okay so we use a long tensor and in here we need to pass a list containing a single value which is either our zero but is the next sentence or one for is not the next sentence in our case our two sentences are supposed to be together so we would pass a zero in here and run that and if we have a look at what we get from there we see that we get this integer tensor so now we're ready to calculate our loss which is really easy so we have our model up here which we have already initialized so we just take that and all we do is pass our inputs from here into our model is keyword arguments so that's what these two symbols are for and then we also pass labels to the labels parameter okay and that will output a couple of tenses for us so we can execute that and let's have a look at what we have so you see that we get these two tensors we have the logits and we also have the loss tensor so let's have a look at the logits and we should be able to recognize this from earlier on where we saw those two nodes and we had the two values one for the index zero for is next and the index one for is not next so let's have a look you can see here that we get both of those so this is our activation for is the next sentence this is our activation for is not the next sentence and if we were to take the arg max of those outputs logics we get zero which means it is the next sentence okay and we also have the loss and this loss tensor that will only be output if we pass our labels here otherwise we just get a logit to test that so when we're training obviously we need labels so that we can calculate the loss and if we just have a look at that we'll see it's just a loss value which is very small because the model is predicting a zero and the label that we've provided is also a zero so the losses is pretty good there so that is how nsp works obviously it's slightly different if you're actually training your model and i am going to cover that in the next video so i'll leave a link to that in the description but for now that's it for this so thank you very much for watching and i'll see you again in the next one

Original Description

Next sentence prediction (NSP) is one-half of the training process behind the BERT model (the other being masked-language modeling - MLM). Where MLM teaches BERT to understand relationships between words - NSP teaches BERT to understand relationships between sentences. In the original BERT paper, it was found that without NSP, BERT performed worse on every single metric - so it's important. Now, when we use a pre-trained BERT model, training with NSP and MLM has already been done, so why do we need to know about it? Well, we can actually further pre-train these pre-trained BERT models so that they better understand the language used in our specific use-cases. To do that, we can use both MLM and NSP. So, in this video, we'll go into depth on what NSP is, how it works, and how we can implement it in code. Training with NSP: https://youtu.be/x1lAcT3xl5M 🤖 70% Discount on the NLP With Transformers in Python course: https://bit.ly/3DFvvY5 📙 Medium article: https://towardsdatascience.com/bert-for-next-sentence-prediction-466b67f8226f 🎉 Sign-up For New Articles Every Week on Medium! https://medium.com/@jamescalam/membership 📖 If membership is too expensive - here's a free link: https://towardsdatascience.com/bert-for-next-sentence-prediction-466b67f8226f?sk=3595968413abde1c5833e1a96e449673 🕹️ Free AI-Powered Code Refactoring with Sourcery: https://sourcery.ai/?utm_source=YouTub&utm_campaign=JBriggs&utm_medium=aff

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from James Briggs · James Briggs · 37 of 60

← Previous Next →

Stoic Philosophy Text Generation with TensorFlow

Stoic Philosophy Text Generation with TensorFlow

How to Build TensorFlow Pipelines with tf.data.Dataset

How to Build TensorFlow Pipelines with tf.data.Dataset

Every New Feature in Python 3.10.0a2

Every New Feature in Python 3.10.0a2

How-to Build a Transformer for Language Classification in TensorFlow

How-to Build a Transformer for Language Classification in TensorFlow

How-to use the Kaggle API in Python

How-to use the Kaggle API in Python

Language Generation with OpenAI's GPT-2 in Python

Language Generation with OpenAI's GPT-2 in Python

Text Summarization with Google AI's T5 in Python

Text Summarization with Google AI's T5 in Python

How-to do Sentiment Analysis with Flair in Python

How-to do Sentiment Analysis with Flair in Python

Python Environment Setup for Machine Learning

Python Environment Setup for Machine Learning

Sequential Model - TensorFlow Essentials #1

Sequential Model - TensorFlow Essentials #1

Functional API - TensorFlow Essentials #2

Functional API - TensorFlow Essentials #2

Training Parameters - TensorFlow Essentials #3

Training Parameters - TensorFlow Essentials #3

Input Data Pipelines - TensorFlow Essentials #4

Input Data Pipelines - TensorFlow Essentials #4

6 of Python's Newest and Best Features (3.7-3.9)

6 of Python's Newest and Best Features (3.7-3.9)

Novice to Advanced RegEx in Less-than 30 Minutes + Python

Novice to Advanced RegEx in Less-than 30 Minutes + Python

Building a PlotLy $GME Chart in Python

Building a PlotLy $GME Chart in Python

How-to Use The Reddit API in Python

How-to Use The Reddit API in Python

How to Build Custom Q&A Transformer Models in Python

How to Build Custom Q&A Transformer Models in Python

How to Build Q&A Models in Python (Transformers)

How to Build Q&A Models in Python (Transformers)

How-to Decode Outputs From NLP Models (Python)

How-to Decode Outputs From NLP Models (Python)

Identify Stocks on Reddit with SpaCy (NER in Python)

Identify Stocks on Reddit with SpaCy (NER in Python)

Sentiment Analysis on ANY Length of Text With Transformers (Python)

Sentiment Analysis on ANY Length of Text With Transformers (Python)

Unicode Normalization for NLP in Python

Unicode Normalization for NLP in Python

The NEW Match-Case Statement in Python 3.10

The NEW Match-Case Statement in Python 3.10

Multi-Class Language Classification With BERT in TensorFlow

Multi-Class Language Classification With BERT in TensorFlow

How to Build Python Packages for Pip

How to Build Python Packages for Pip

How-to Structure a Q&A ML App

How-to Structure a Q&A ML App

How to Index Q&A Data With Haystack and Elasticsearch

How to Index Q&A Data With Haystack and Elasticsearch

Q&A Document Retrieval With DPR

Q&A Document Retrieval With DPR

How to Use Type Annotations in Python

How to Use Type Annotations in Python

Extractive Q&A With Haystack and FastAPI in Python

Extractive Q&A With Haystack and FastAPI in Python

Sentence Similarity With Sentence-Transformers in Python

Sentence Similarity With Sentence-Transformers in Python

Sentence Similarity With Transformers and PyTorch (Python)

Sentence Similarity With Transformers and PyTorch (Python)

NER With Transformers and spaCy (Python)

NER With Transformers and spaCy (Python)

Training BERT #1 - Masked-Language Modeling (MLM)

Training BERT #1 - Masked-Language Modeling (MLM)

Training BERT #2 - Train With Masked-Language Modeling (MLM)

Training BERT #2 - Train With Masked-Language Modeling (MLM)

Training BERT #3 - Next Sentence Prediction (NSP)

Training BERT #3 - Next Sentence Prediction (NSP)

Training BERT #4 - Train With Next Sentence Prediction (NSP)

Training BERT #4 - Train With Next Sentence Prediction (NSP)

FREE 11 Hour NLP Transformers Course (Next 3 Days Only)

FREE 11 Hour NLP Transformers Course (Next 3 Days Only)

New Features in Python 3.10

New Features in Python 3.10

Training BERT #5 - Training With BertForPretraining

Training BERT #5 - Training With BertForPretraining

How-to Use HuggingFace's Datasets - Transformers From Scratch #1

How-to Use HuggingFace's Datasets - Transformers From Scratch #1

Build a Custom Transformer Tokenizer - Transformers From Scratch #2

Build a Custom Transformer Tokenizer - Transformers From Scratch #2

3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)

3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

Building MLM Training Input Pipeline - Transformers From Scratch #3

Building MLM Training Input Pipeline - Transformers From Scratch #3

Training and Testing an Italian BERT - Transformers From Scratch #4

Training and Testing an Italian BERT - Transformers From Scratch #4

Faiss - Introduction to Similarity Search

Faiss - Introduction to Similarity Search

Angular App Setup With Material - Stoic Q&A #5

Angular App Setup With Material - Stoic Q&A #5

Why are there so many Tokenization methods in HF Transformers?

Why are there so many Tokenization methods in HF Transformers?

Choosing Indexes for Similarity Search (Faiss in Python)

Choosing Indexes for Similarity Search (Faiss in Python)

Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)

Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)

How LSH Random Projection works in search (+Python)

How LSH Random Projection works in search (+Python)

IndexLSH for Fast Similarity Search in Faiss

IndexLSH for Fast Similarity Search in Faiss

Faiss - Vector Compression with PQ and IVFPQ (in Python)

Faiss - Vector Compression with PQ and IVFPQ (in Python)

Product Quantization for Vector Similarity Search (+ Python)

Product Quantization for Vector Similarity Search (+ Python)

How to Build a Bert WordPiece Tokenizer in Python and HuggingFace

How to Build a Bert WordPiece Tokenizer in Python and HuggingFace

Metadata Filtering for Vector Search + Latest Filter Tech

Metadata Filtering for Vector Search + Latest Filter Tech

Build NLP Pipelines with HuggingFace Datasets

Build NLP Pipelines with HuggingFace Datasets

Composite Indexes and the Faiss Index Factory

Composite Indexes and the Faiss Index Factory

This video teaches how to use Next Sentence Prediction (NSP) to train BERT models, including fine-tuning and text classification tasks, and demonstrates the importance of NSP in understanding sentence relationships. By following this video, viewers can learn how to utilize NSP to improve their BERT models and apply them to various NLP tasks. The video also highlights the use of tools such as Hugging Face Transformers and PyTorch for NSP tasks.

Key Takeaways

Tokenize input text using Hugging Face Transformers and PyTorch
Create a classification label (0 or 1) for the next sentence prediction task
Calculate the loss using the model and inputs
Initialize model and tokenizer using PyTorch and Hugging Face Transformers
Pass zero as label for next sentence prediction
Calculate loss using model inputs and labels
Get logits and loss tensors from model output
Use arg max to get predicted next sentence index

💡 NSP is a crucial component of BERT training, allowing the model to understand relationships between sentences, and its absence can significantly impact the model's performance.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

The 2026 AI Model Release Race: Every Major LLM Launch You Need to Know

Stay updated on the 2026 AI model release race, including major LLM launches like Claude Sonnet 5 and GPT-5.6, to leverage the latest advancements in AI technology

Call GPT, Claude, and Gemini from one API key — a 3-step setup

Access GPT, Claude, and Gemini through one API key with a 3-step setup using Modelishub

Your LLM Doesn’t Pick Stocks — It Remembers Them

Discover how LLMs remember stock picks rather than making actual predictions, and why this matters for AI-driven investment strategies

Medium · Machine Learning

Word Representation

Learn how word representation works in NLP and its importance in understanding human language, enabling applications like text classification and language translation

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)