Sentence Similarity With Transformers and PyTorch (Python)

James Briggs · Beginner ·🧬 Deep Learning ·5y ago

Skills: LLM Foundations90%Prompt Craft80%LLM Engineering80%

Key Takeaways

This video demonstrates how to use BERT and PyTorch to calculate sentence similarity, utilizing techniques such as mean pooling and cosine similarity to compare sentence embeddings. The video showcases the use of Sentence Transformers, AutoTokenizer, and AutoModel from pre-trained models to create 768-dim embeddings for sentences.

Full Transcript

today we're going to have a look at how we can use transformers like bert to create embeddings for sentences and how we can then take those sentence vectors and use them to calculate the semantic similarity between different sentences so at a high level what you can see on the screen right now is a base model inside workbase we have multiple encoders and at the bottom we can see we have our tokenized text we have 512 tokens here and they get passed into our first encoder to create these hidden state vectors which are of the size 768 in bert now these get processed through multiple encoders and between every one of these encoders that's 12 in total there are going to be a vector of size 768 for every single token that we have so 512 tokens in this case now what we're going to do is take the final tensor out here so this last hidden state tensor and we're going to use mean pooling to compress it into a 760 by one vector and that is our sentence vector then once we've built our sentence vector we're going to use cosine similarity to compare different sentences and see if we can get something that works so switching across to python these are the sentences we're going to be comparing and there's two so there's this one here which is three years later the coffin was still full of jello and that has the same meaning as this here i just rewrote it but with completely different words so i don't think there's really any words here that match so instead of years we have dozens of months jelly jello coffin person box all right no normal human would even say that sex well no normal human would probably say either of those but we definitely wouldn't use person box for coffin and many dozens of months for years so it's reasonably complicated but we'll see that this should work for similarities so we'll find that these two share the highest similarity score after we've encoded them with bet and calculate our codes on similarity and down here is the model we'll be using so we're going to be using sentence transformers and then the bert based nli mean tokens model now there's two approaches that we can take here the easy approach using something called sentence transformers i'm going to be covering that in another video and this approach which is a little more involved where we're going to be using transformers and pi torch so the first thing we need to do is actually create our last hidden state tensor so of course we need to import the libraries that we're going to be using so transformers we're going to be using the auto tokenizer and the auto model and then we need to import torch as well and then after we've imported these we need to first initialize our tokenizer model which we just do auto tokenizer and then for both these we're going to use from pre-trained [Music] and we're going to use the model name that we've already defined so these are coming from face library obviously and we can see the model here so it's this one and then our model is auto model from pre-trained again from those and now what we want to do is tokenize all of our sentences now to do this we're going to use a tokens dictionary and in here we're going to have input ids and this will contain a list and you'll see why in a moment and attention mask which will also contain a list now when we're going through each sentence we have to do this one by one for sentence in sentences we are going to be using the tokenizers encode plus method so tokenizer encode plus and then in here we need to pass our sentence we need to pass the maximum length of our sequence so with bert usually we would set this to 512 but because we're using this bert based nli mean tokens model this should actually be set to 128 so we set max length to 128 and anything longer than this we want to truncate so we set truncation equal to true and anything shorter than this which they all will be in our case we set padding equal to the max length to pad it up to that much length and then here we want to say return [Music] tensors and we set this equal to pt because we're using pi torch now this will return a dictionary containing input ids and attention mask for a single sentence so we'll take the new tokens assign it to that variable and then what we're going to do is access our tokens dictionary input ids first and append the input ids for the single sentence from the new tokens variable so input ids and then we do the same for our attention mask okay so that gives us those there's another thing as well we these are wrapped as vectors so we also want to just extract the first element there because it's they're like almost like lists within a list but in intensive format and we want to extract the list now that's good but obviously we're using pi torch here we want pi torch tensors not list so within these lists we do have pytorch tensors so in fact let me just show you so if we have a look in here we'll see that we have our pie touch sensors but they're contained within a normal python list so we can even check that if we do type we see it we get lists and inside there we have the torch tensor which is what we want for all of them so to convert this list of pi touch tensors into a single pi torch tensor what we do is we take this torch and we use the stack method and what the sac method does is takes a list and within that list rule let's put pytorch tensors and it will stack all of those on top of each other essentially adding another dimension and stacking them all on top of each other which hence the hence the name so take that and we want to do it for both input ids and the tension mask and then let's have a look at what we have so let's go attention or input ids and now we just have a single tensor okay so we type and now we just have a tensor now that's great check its size so we have six sentences that have all been encoded into the 128 tokens ready to go into our model so to process these through our model we'll output the outputs to this outputs variable and we take our model and we pass our tokens as keyword arguments into the model input there so we process that and that will give us this output object and inside this ip object we have the last hidden state tensor here and we can also see that if we print out keys you see that we have less than say and we also have this pooler output now we want to take our last hidden state tensor and then perform the mean pooling operation to convert it into a sentence vector so to get that last hidden state we will assign it to this embeddings variable and we extract it using hidden or last hidden state like that and let's just check what we have here so we'll just hold good shape and you see now we have the six sentences we have the 128 tokens and then we have the 768 dimension size which is just the hidden state dimensions within bert so what we have at the moment is this last hidden state tensor and what we're going to do is now convert it into this using a mean pulling operation so the the first thing we need to do is multiply every value within this last hidden state tensor by zero where we shouldn't have a real token so if we look up here we've padded all of these and obviously there's more padding tokens in this sentence than there are in this sentence so we need to take each of those attention mass tenses that we took here which just contain ones and zeros ones where there's real tokens at zero is where there are padding tokens and multiply that out to remove any activations where there should just be padding tokens eg zeros now the only problem is that if we have a look at our attention mask so tokens attention mass if we have a look at the size we get a 6 by 128 so what we need to do is add this other dimension which is the 768 and then we can just multiply those two tensors together and this will remove the embedding of values where there shouldn't be embedding values and to do that we'll we'll assign it to mass but we'll do it later actually so attention and what i want to do is use the unsqueeze method and if we so look at the shape so we can see what is actually happening here see that we've added this other dimension and then what that allows us to do is expand that dimension out to 768 which will then match to the correct shape that we need to multiply those two together so we do expand and here what we want is we'll take embeddings and we want to expand it out to the embeddings shape that we have already use up here so that will compare these two and see that we need to expand this one dimension out to 768 and if we execute that we can see that it has worked so the final thing that we need to do there is convert that into a float tensor and then we assign that to the mass here so this uh float at the end that's just converting it from integer to float so now what we can do is apply this mask to our embeddings so we'll call this one mask embeddings and it is very simple we just do embeddings multiplied by mask and now if we just compare embeddings have a look what we have here so it's quite a lot and now we have a look at mass embeddings and you see here that we have the same values here so looking at the top these are the same but then these values here have been mapped to zero because they are just padding tokens we don't want to pay attention to those so that's the point of the masking operation there so remove those and now what we want to do is take all of those embeddings because if we have a look at the shape that we have we still have this 128 tokens we want to convert this into one token and there's two operations that we need to do here so we're doing a mean pooling operation so we need to calculate the sum within each of these so if we summed all these up together that's what we are going to be doing and pushing them into a single value and then we also need to count all of those values but only where we were supposed to be paying attention so when we converted them into zeros we don't want to count those values and then we divide that sum by the count to get our mean so to get the summed we do torch dot sum and then just mass embeddings and this is in the dimension one which is this dimension here let's have a look at the shape that we have here okay so now we can see that we've removed this dimension and now what we want to do is create our counts and to do this we use a slightly different approach we just do torch clamp and then inside here we do mass dot sum again in the dimension one and then we also have we also add a min argument here which just stops us from creating any divide by zero error so we do one e and all this needs to be is a very small number i think by default it's one e to the minus eight but i usually just use one e to the minus nine although in reality it shouldn't really make a difference and sorry just put counts there okay so that's our sum and our counts and now we get the mean pulled so we do mean board equals summed divided by the counts and we'll just check the size of that again okay so that is our sentence vector so we have six of them here each one contains just 768 values and let's have a look at what they look like we just get these values here now what we can do is compare each of these and see which ones get the highest cosine similarity value now we're going to be using the sk learn implementation which is metrics dot pairwise we import cosine similarity and then this would expect numpy arrays obviously we have pi touch tensors so we are going to get an error i'm gonna i'm going to show you so you at least see it you know how to fix it so the cosine similarity and in here we want to pass a single vector that we are going to be comparing so i'm going to compare the first text sentence so if we just take these and put them down here so i'm going to take the very first one of those which is mean pulled 0 and because we are extracting this out directly that means we get a it's like a list format we want it to be in a vector format so it's a list within the list and then we want to extract the remaining that's it five yeah five sentences so go one all the way to the end so that says last five there now if we run this we're going to get this runtime error we go down and we see current call numpy on tensor that requires grad so this is just with pi torch we this tensor is currently within our pi torch model and we need to detach it from pythog in order to convert it into something that pi torch cannot read anymore and it actually tells us exactly what i need to do so use tensor detach numpy instead so we take detach and numpy and all we need to do is write mean pooled equals that we run it and we get our similarity scores so straight away we got .33 one seven four four five five this one is the one the high similarity point 72 by a fair bit as well so that is comparing this sentence and sentence at index one of our last five which is this one so there we've calculated similarity and it is clearly working so that's it for this video i hope it's been useful i think this is really cool and i'll see you in the next one

Original Description

Easy mode: https://youtu.be/Ey81KfQ3PQU All we ever seem to talk about nowadays are BERT this, BERT that. I want to talk about something else, but BERT is just too good - so this video will be about BERT for sentence similarity. A big part of NLP relies on similarity in highly-dimensional spaces. Typically an NLP solution will take some text, process it to create a big vector/array representing said text - then perform several transformations. It's highly-dimensional magic. Sentence similarity is one of the clearest examples of how powerful highly-dimensional magic can be. The logic is this: - Take a sentence, convert it into a vector. - Take many other sentences, and convert them into vectors. - Find sentences that have the smallest distance (Euclidean) or smallest angle (cosine similarity) between them - more on that here. - We now have a measure of semantic similarity between sentences - easy! At a high level, there's not much else to it. But of course, we want to understand what is happening in a little more detail and implement this in Python too. 🤖 70% Discount on the NLP With Transformers in Python course: https://bit.ly/3DFvvY5 Medium article: https://towardsdatascience.com/bert-for-measuring-text-similarity-eec91c6bf9e1 🎉 Sign-up For New Articles Every Week on Medium! https://medium.com/@jamescalam/membership 📖 If membership is too expensive - here's a free link: https://towardsdatascience.com/bert-for-measuring-text-similarity-eec91c6bf9e1?sk=c0f2990b4660210b447e52d55bd0f4e5 👾 Discord https://discord.gg/c5QtDB9RAP 🕹️ Free AI-Powered Code Refactoring with Sourcery: https://sourcery.ai/?utm_source=YouTub&utm_campaign=JBriggs&utm_medium=aff 00:00 Intro 00:16 BERT Base Network 1:11 Sentence Vectors and Similarity 1:47 The Data and Model 3:01 Two Approaches 3:16 Tokenizing Sentences 9:11 Creating last_hidden_state Tensor 11:08 Creating Sentence Vectors 17:53 Cosine Similarity

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from James Briggs · James Briggs · 33 of 60

← Previous Next →

Stoic Philosophy Text Generation with TensorFlow

Stoic Philosophy Text Generation with TensorFlow

How to Build TensorFlow Pipelines with tf.data.Dataset

How to Build TensorFlow Pipelines with tf.data.Dataset

Every New Feature in Python 3.10.0a2

Every New Feature in Python 3.10.0a2

How-to Build a Transformer for Language Classification in TensorFlow

How-to Build a Transformer for Language Classification in TensorFlow

How-to use the Kaggle API in Python

How-to use the Kaggle API in Python

Language Generation with OpenAI's GPT-2 in Python

Language Generation with OpenAI's GPT-2 in Python

Text Summarization with Google AI's T5 in Python

Text Summarization with Google AI's T5 in Python

How-to do Sentiment Analysis with Flair in Python

How-to do Sentiment Analysis with Flair in Python

Python Environment Setup for Machine Learning

Python Environment Setup for Machine Learning

Sequential Model - TensorFlow Essentials #1

Sequential Model - TensorFlow Essentials #1

Functional API - TensorFlow Essentials #2

Functional API - TensorFlow Essentials #2

Training Parameters - TensorFlow Essentials #3

Training Parameters - TensorFlow Essentials #3

Input Data Pipelines - TensorFlow Essentials #4

Input Data Pipelines - TensorFlow Essentials #4

6 of Python's Newest and Best Features (3.7-3.9)

6 of Python's Newest and Best Features (3.7-3.9)

Novice to Advanced RegEx in Less-than 30 Minutes + Python

Novice to Advanced RegEx in Less-than 30 Minutes + Python

Building a PlotLy $GME Chart in Python

Building a PlotLy $GME Chart in Python

How-to Use The Reddit API in Python

How-to Use The Reddit API in Python

How to Build Custom Q&A Transformer Models in Python

How to Build Custom Q&A Transformer Models in Python

How to Build Q&A Models in Python (Transformers)

How to Build Q&A Models in Python (Transformers)

How-to Decode Outputs From NLP Models (Python)

How-to Decode Outputs From NLP Models (Python)

Identify Stocks on Reddit with SpaCy (NER in Python)

Identify Stocks on Reddit with SpaCy (NER in Python)

Sentiment Analysis on ANY Length of Text With Transformers (Python)

Sentiment Analysis on ANY Length of Text With Transformers (Python)

Unicode Normalization for NLP in Python

Unicode Normalization for NLP in Python

The NEW Match-Case Statement in Python 3.10

The NEW Match-Case Statement in Python 3.10

Multi-Class Language Classification With BERT in TensorFlow

Multi-Class Language Classification With BERT in TensorFlow

How to Build Python Packages for Pip

How to Build Python Packages for Pip

How-to Structure a Q&A ML App

How-to Structure a Q&A ML App

How to Index Q&A Data With Haystack and Elasticsearch

How to Index Q&A Data With Haystack and Elasticsearch

Q&A Document Retrieval With DPR

Q&A Document Retrieval With DPR

How to Use Type Annotations in Python

How to Use Type Annotations in Python

Extractive Q&A With Haystack and FastAPI in Python

Extractive Q&A With Haystack and FastAPI in Python

Sentence Similarity With Sentence-Transformers in Python

Sentence Similarity With Sentence-Transformers in Python

Sentence Similarity With Transformers and PyTorch (Python)

Sentence Similarity With Transformers and PyTorch (Python)

NER With Transformers and spaCy (Python)

NER With Transformers and spaCy (Python)

Training BERT #1 - Masked-Language Modeling (MLM)

Training BERT #1 - Masked-Language Modeling (MLM)

Training BERT #2 - Train With Masked-Language Modeling (MLM)

Training BERT #2 - Train With Masked-Language Modeling (MLM)

Training BERT #3 - Next Sentence Prediction (NSP)

Training BERT #3 - Next Sentence Prediction (NSP)

Training BERT #4 - Train With Next Sentence Prediction (NSP)

Training BERT #4 - Train With Next Sentence Prediction (NSP)

FREE 11 Hour NLP Transformers Course (Next 3 Days Only)

FREE 11 Hour NLP Transformers Course (Next 3 Days Only)

New Features in Python 3.10

New Features in Python 3.10

Training BERT #5 - Training With BertForPretraining

Training BERT #5 - Training With BertForPretraining

How-to Use HuggingFace's Datasets - Transformers From Scratch #1

How-to Use HuggingFace's Datasets - Transformers From Scratch #1

Build a Custom Transformer Tokenizer - Transformers From Scratch #2

Build a Custom Transformer Tokenizer - Transformers From Scratch #2

3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)

3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

Building MLM Training Input Pipeline - Transformers From Scratch #3

Building MLM Training Input Pipeline - Transformers From Scratch #3

Training and Testing an Italian BERT - Transformers From Scratch #4

Training and Testing an Italian BERT - Transformers From Scratch #4

Faiss - Introduction to Similarity Search

Faiss - Introduction to Similarity Search

Angular App Setup With Material - Stoic Q&A #5

Angular App Setup With Material - Stoic Q&A #5

Why are there so many Tokenization methods in HF Transformers?

Why are there so many Tokenization methods in HF Transformers?

Choosing Indexes for Similarity Search (Faiss in Python)

Choosing Indexes for Similarity Search (Faiss in Python)

Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)

Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)

How LSH Random Projection works in search (+Python)

How LSH Random Projection works in search (+Python)

IndexLSH for Fast Similarity Search in Faiss

IndexLSH for Fast Similarity Search in Faiss

Faiss - Vector Compression with PQ and IVFPQ (in Python)

Faiss - Vector Compression with PQ and IVFPQ (in Python)

Product Quantization for Vector Similarity Search (+ Python)

Product Quantization for Vector Similarity Search (+ Python)

How to Build a Bert WordPiece Tokenizer in Python and HuggingFace

How to Build a Bert WordPiece Tokenizer in Python and HuggingFace

Metadata Filtering for Vector Search + Latest Filter Tech

Metadata Filtering for Vector Search + Latest Filter Tech

Build NLP Pipelines with HuggingFace Datasets

Build NLP Pipelines with HuggingFace Datasets

Composite Indexes and the Faiss Index Factory

Composite Indexes and the Faiss Index Factory

This video teaches how to use BERT and PyTorch to calculate sentence similarity, covering topics such as mean pooling, cosine similarity, and sentence embeddings. By the end of the video, viewers will be able to build their own sentence similarity models using BERT and PyTorch. The video is particularly useful for those interested in natural language processing and machine learning.

Key Takeaways

Initialize tokenizer model and model
Tokenize all sentences using encode_plus method
Truncate sentences to 128 tokens and pad to max length
Use PyTorch tensors instead of lists
Stack PyTorch tensors to create a single tensor
Use unsqueeze method to add dimension to attention mask
Remove padding token embeddings by multiplying attention mask with embeddings
Apply mean pooling operation to get sentence vector
Calculate mean of embeddings by summing and dividing by count
Compare sentence vectors using cosine similarity

💡 The use of mean pooling and cosine similarity allows for efficient and accurate calculation of sentence similarity, making it a valuable technique for natural language processing tasks.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related Reads

Understanding Deep Learning Through Four Interactive Experiments

Explore deep learning concepts through interactive experiments to gain hands-on understanding

Medium · Data Science

Understanding Deep Learning Through Four Interactive Experiments

Explore deep learning through interactive experiments to gain hands-on understanding

Medium · Deep Learning

Optimizers in Deep Learning: From Gradient Descent to Adam

Learn how optimizers in deep learning work, from basic Gradient Descent to advanced Adam optimizer, to improve model training

Medium · Deep Learning

The Meta-Architecture of Interface Fracture: High-Dimensional Logical Stress and Systemic Collapse…

Learn about the meta-architecture of interface fracture and its relation to high-dimensional logical stress and systemic collapse in deep learning systems

Medium · Deep Learning

Chapters (9)

Intro

0:16 BERT Base Network

1:11 Sentence Vectors and Similarity

1:47 The Data and Model

3:01 Two Approaches

3:16 Tokenizing Sentences

9:11 Creating last_hidden_state Tensor

11:08 Creating Sentence Vectors

17:53 Cosine Similarity

Image Classification with ml5.js

The Coding Train