Sentence Similarity With Transformers and PyTorch (Python)
Key Takeaways
This video demonstrates how to use BERT and PyTorch to calculate sentence similarity, utilizing techniques such as mean pooling and cosine similarity to compare sentence embeddings. The video showcases the use of Sentence Transformers, AutoTokenizer, and AutoModel from pre-trained models to create 768-dim embeddings for sentences.
Full Transcript
today we're going to have a look at how we can use transformers like bert to create embeddings for sentences and how we can then take those sentence vectors and use them to calculate the semantic similarity between different sentences so at a high level what you can see on the screen right now is a base model inside workbase we have multiple encoders and at the bottom we can see we have our tokenized text we have 512 tokens here and they get passed into our first encoder to create these hidden state vectors which are of the size 768 in bert now these get processed through multiple encoders and between every one of these encoders that's 12 in total there are going to be a vector of size 768 for every single token that we have so 512 tokens in this case now what we're going to do is take the final tensor out here so this last hidden state tensor and we're going to use mean pooling to compress it into a 760 by one vector and that is our sentence vector then once we've built our sentence vector we're going to use cosine similarity to compare different sentences and see if we can get something that works so switching across to python these are the sentences we're going to be comparing and there's two so there's this one here which is three years later the coffin was still full of jello and that has the same meaning as this here i just rewrote it but with completely different words so i don't think there's really any words here that match so instead of years we have dozens of months jelly jello coffin person box all right no normal human would even say that sex well no normal human would probably say either of those but we definitely wouldn't use person box for coffin and many dozens of months for years so it's reasonably complicated but we'll see that this should work for similarities so we'll find that these two share the highest similarity score after we've encoded them with bet and calculate our codes on similarity and down here is the model we'll be using so we're going to be using sentence transformers and then the bert based nli mean tokens model now there's two approaches that we can take here the easy approach using something called sentence transformers i'm going to be covering that in another video and this approach which is a little more involved where we're going to be using transformers and pi torch so the first thing we need to do is actually create our last hidden state tensor so of course we need to import the libraries that we're going to be using so transformers we're going to be using the auto tokenizer and the auto model and then we need to import torch as well and then after we've imported these we need to first initialize our tokenizer model which we just do auto tokenizer and then for both these we're going to use from pre-trained [Music] and we're going to use the model name that we've already defined so these are coming from face library obviously and we can see the model here so it's this one and then our model is auto model from pre-trained again from those and now what we want to do is tokenize all of our sentences now to do this we're going to use a tokens dictionary and in here we're going to have input ids and this will contain a list and you'll see why in a moment and attention mask which will also contain a list now when we're going through each sentence we have to do this one by one for sentence in sentences we are going to be using the tokenizers encode plus method so tokenizer encode plus and then in here we need to pass our sentence we need to pass the maximum length of our sequence so with bert usually we would set this to 512 but because we're using this bert based nli mean tokens model this should actually be set to 128 so we set max length to 128 and anything longer than this we want to truncate so we set truncation equal to true and anything shorter than this which they all will be in our case we set padding equal to the max length to pad it up to that much length and then here we want to say return [Music] tensors and we set this equal to pt because we're using pi torch now this will return a dictionary containing input ids and attention mask for a single sentence so we'll take the new tokens assign it to that variable and then what we're going to do is access our tokens dictionary input ids first and append the input ids for the single sentence from the new tokens variable so input ids and then we do the same for our attention mask okay so that gives us those there's another thing as well we these are wrapped as vectors so we also want to just extract the first element there because it's they're like almost like lists within a list but in intensive format and we want to extract the list now that's good but obviously we're using pi torch here we want pi torch tensors not list so within these lists we do have pytorch tensors so in fact let me just show you so if we have a look in here we'll see that we have our pie touch sensors but they're contained within a normal python list so we can even check that if we do type we see it we get lists and inside there we have the torch tensor which is what we want for all of them so to convert this list of pi touch tensors into a single pi torch tensor what we do is we take this torch and we use the stack method and what the sac method does is takes a list and within that list rule let's put pytorch tensors and it will stack all of those on top of each other essentially adding another dimension and stacking them all on top of each other which hence the hence the name so take that and we want to do it for both input ids and the tension mask and then let's have a look at what we have so let's go attention or input ids and now we just have a single tensor okay so we type and now we just have a tensor now that's great check its size so we have six sentences that have all been encoded into the 128 tokens ready to go into our model so to process these through our model we'll output the outputs to this outputs variable and we take our model and we pass our tokens as keyword arguments into the model input there so we process that and that will give us this output object and inside this ip object we have the last hidden state tensor here and we can also see that if we print out keys you see that we have less than say and we also have this pooler output now we want to take our last hidden state tensor and then perform the mean pooling operation to convert it into a sentence vector so to get that last hidden state we will assign it to this embeddings variable and we extract it using hidden or last hidden state like that and let's just check what we have here so we'll just hold good shape and you see now we have the six sentences we have the 128 tokens and then we have the 768 dimension size which is just the hidden state dimensions within bert so what we have at the moment is this last hidden state tensor and what we're going to do is now convert it into this using a mean pulling operation so the the first thing we need to do is multiply every value within this last hidden state tensor by zero where we shouldn't have a real token so if we look up here we've padded all of these and obviously there's more padding tokens in this sentence than there are in this sentence so we need to take each of those attention mass tenses that we took here which just contain ones and zeros ones where there's real tokens at zero is where there are padding tokens and multiply that out to remove any activations where there should just be padding tokens eg zeros now the only problem is that if we have a look at our attention mask so tokens attention mass if we have a look at the size we get a 6 by 128 so what we need to do is add this other dimension which is the 768 and then we can just multiply those two tensors together and this will remove the embedding of values where there shouldn't be embedding values and to do that we'll we'll assign it to mass but we'll do it later actually so attention and what i want to do is use the unsqueeze method and if we so look at the shape so we can see what is actually happening here see that we've added this other dimension and then what that allows us to do is expand that dimension out to 768 which will then match to the correct shape that we need to multiply those two together so we do expand and here what we want is we'll take embeddings and we want to expand it out to the embeddings shape that we have already use up here so that will compare these two and see that we need to expand this one dimension out to 768 and if we execute that we can see that it has worked so the final thing that we need to do there is convert that into a float tensor and then we assign that to the mass here so this uh float at the end that's just converting it from integer to float so now what we can do is apply this mask to our embeddings so we'll call this one mask embeddings and it is very simple we just do embeddings multiplied by mask and now if we just compare embeddings have a look what we have here so it's quite a lot and now we have a look at mass embeddings and you see here that we have the same values here so looking at the top these are the same but then these values here have been mapped to zero because they are just padding tokens we don't want to pay attention to those so that's the point of the masking operation there so remove those and now what we want to do is take all of those embeddings because if we have a look at the shape that we have we still have this 128 tokens we want to convert this into one token and there's two operations that we need to do here so we're doing a mean pooling operation so we need to calculate the sum within each of these so if we summed all these up together that's what we are going to be doing and pushing them into a single value and then we also need to count all of those values but only where we were supposed to be paying attention so when we converted them into zeros we don't want to count those values and then we divide that sum by the count to get our mean so to get the summed we do torch dot sum and then just mass embeddings and this is in the dimension one which is this dimension here let's have a look at the shape that we have here okay so now we can see that we've removed this dimension and now what we want to do is create our counts and to do this we use a slightly different approach we just do torch clamp and then inside here we do mass dot sum again in the dimension one and then we also have we also add a min argument here which just stops us from creating any divide by zero error so we do one e and all this needs to be is a very small number i think by default it's one e to the minus eight but i usually just use one e to the minus nine although in reality it shouldn't really make a difference and sorry just put counts there okay so that's our sum and our counts and now we get the mean pulled so we do mean board equals summed divided by the counts and we'll just check the size of that again okay so that is our sentence vector so we have six of them here each one contains just 768 values and let's have a look at what they look like we just get these values here now what we can do is compare each of these and see which ones get the highest cosine similarity value now we're going to be using the sk learn implementation which is metrics dot pairwise we import cosine similarity and then this would expect numpy arrays obviously we have pi touch tensors so we are going to get an error i'm gonna i'm going to show you so you at least see it you know how to fix it so the cosine similarity and in here we want to pass a single vector that we are going to be comparing so i'm going to compare the first text sentence so if we just take these and put them down here so i'm going to take the very first one of those which is mean pulled 0 and because we are extracting this out directly that means we get a it's like a list format we want it to be in a vector format so it's a list within the list and then we want to extract the remaining that's it five yeah five sentences so go one all the way to the end so that says last five there now if we run this we're going to get this runtime error we go down and we see current call numpy on tensor that requires grad so this is just with pi torch we this tensor is currently within our pi torch model and we need to detach it from pythog in order to convert it into something that pi torch cannot read anymore and it actually tells us exactly what i need to do so use tensor detach numpy instead so we take detach and numpy and all we need to do is write mean pooled equals that we run it and we get our similarity scores so straight away we got .33 one seven four four five five this one is the one the high similarity point 72 by a fair bit as well so that is comparing this sentence and sentence at index one of our last five which is this one so there we've calculated similarity and it is clearly working so that's it for this video i hope it's been useful i think this is really cool and i'll see you in the next one
Original Description
Easy mode: https://youtu.be/Ey81KfQ3PQU
All we ever seem to talk about nowadays are BERT this, BERT that. I want to talk about something else, but BERT is just too good - so this video will be about BERT for sentence similarity.
A big part of NLP relies on similarity in highly-dimensional spaces. Typically an NLP solution will take some text, process it to create a big vector/array representing said text - then perform several transformations.
It's highly-dimensional magic.
Sentence similarity is one of the clearest examples of how powerful highly-dimensional magic can be.
The logic is this:
- Take a sentence, convert it into a vector.
- Take many other sentences, and convert them into vectors.
- Find sentences that have the smallest distance (Euclidean) or smallest angle (cosine similarity) between them - more on that here.
- We now have a measure of semantic similarity between sentences - easy!
At a high level, there's not much else to it. But of course, we want to understand what is happening in a little more detail and implement this in Python too.
🤖 70% Discount on the NLP With Transformers in Python course:
https://bit.ly/3DFvvY5
Medium article:
https://towardsdatascience.com/bert-for-measuring-text-similarity-eec91c6bf9e1
🎉 Sign-up For New Articles Every Week on Medium!
https://medium.com/@jamescalam/membership
📖 If membership is too expensive - here's a free link:
https://towardsdatascience.com/bert-for-measuring-text-similarity-eec91c6bf9e1?sk=c0f2990b4660210b447e52d55bd0f4e5
👾 Discord
https://discord.gg/c5QtDB9RAP
🕹️ Free AI-Powered Code Refactoring with Sourcery:
https://sourcery.ai/?utm_source=YouTub&utm_campaign=JBriggs&utm_medium=aff
00:00 Intro
00:16 BERT Base Network
1:11 Sentence Vectors and Similarity
1:47 The Data and Model
3:01 Two Approaches
3:16 Tokenizing Sentences
9:11 Creating last_hidden_state Tensor
11:08 Creating Sentence Vectors
17:53 Cosine Similarity
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from James Briggs · James Briggs · 33 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
▶
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Stoic Philosophy Text Generation with TensorFlow
James Briggs
How to Build TensorFlow Pipelines with tf.data.Dataset
James Briggs
Every New Feature in Python 3.10.0a2
James Briggs
How-to Build a Transformer for Language Classification in TensorFlow
James Briggs
How-to use the Kaggle API in Python
James Briggs
Language Generation with OpenAI's GPT-2 in Python
James Briggs
Text Summarization with Google AI's T5 in Python
James Briggs
How-to do Sentiment Analysis with Flair in Python
James Briggs
Python Environment Setup for Machine Learning
James Briggs
Sequential Model - TensorFlow Essentials #1
James Briggs
Functional API - TensorFlow Essentials #2
James Briggs
Training Parameters - TensorFlow Essentials #3
James Briggs
Input Data Pipelines - TensorFlow Essentials #4
James Briggs
6 of Python's Newest and Best Features (3.7-3.9)
James Briggs
Novice to Advanced RegEx in Less-than 30 Minutes + Python
James Briggs
Building a PlotLy $GME Chart in Python
James Briggs
How-to Use The Reddit API in Python
James Briggs
How to Build Custom Q&A Transformer Models in Python
James Briggs
How to Build Q&A Models in Python (Transformers)
James Briggs
How-to Decode Outputs From NLP Models (Python)
James Briggs
Identify Stocks on Reddit with SpaCy (NER in Python)
James Briggs
Sentiment Analysis on ANY Length of Text With Transformers (Python)
James Briggs
Unicode Normalization for NLP in Python
James Briggs
The NEW Match-Case Statement in Python 3.10
James Briggs
Multi-Class Language Classification With BERT in TensorFlow
James Briggs
How to Build Python Packages for Pip
James Briggs
How-to Structure a Q&A ML App
James Briggs
How to Index Q&A Data With Haystack and Elasticsearch
James Briggs
Q&A Document Retrieval With DPR
James Briggs
How to Use Type Annotations in Python
James Briggs
Extractive Q&A With Haystack and FastAPI in Python
James Briggs
Sentence Similarity With Sentence-Transformers in Python
James Briggs
Sentence Similarity With Transformers and PyTorch (Python)
James Briggs
NER With Transformers and spaCy (Python)
James Briggs
Training BERT #1 - Masked-Language Modeling (MLM)
James Briggs
Training BERT #2 - Train With Masked-Language Modeling (MLM)
James Briggs
Training BERT #3 - Next Sentence Prediction (NSP)
James Briggs
Training BERT #4 - Train With Next Sentence Prediction (NSP)
James Briggs
FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
James Briggs
New Features in Python 3.10
James Briggs
Training BERT #5 - Training With BertForPretraining
James Briggs
How-to Use HuggingFace's Datasets - Transformers From Scratch #1
James Briggs
Build a Custom Transformer Tokenizer - Transformers From Scratch #2
James Briggs
3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
James Briggs
3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
James Briggs
Building MLM Training Input Pipeline - Transformers From Scratch #3
James Briggs
Training and Testing an Italian BERT - Transformers From Scratch #4
James Briggs
Faiss - Introduction to Similarity Search
James Briggs
Angular App Setup With Material - Stoic Q&A #5
James Briggs
Why are there so many Tokenization methods in HF Transformers?
James Briggs
Choosing Indexes for Similarity Search (Faiss in Python)
James Briggs
Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
James Briggs
How LSH Random Projection works in search (+Python)
James Briggs
IndexLSH for Fast Similarity Search in Faiss
James Briggs
Faiss - Vector Compression with PQ and IVFPQ (in Python)
James Briggs
Product Quantization for Vector Similarity Search (+ Python)
James Briggs
How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
James Briggs
Metadata Filtering for Vector Search + Latest Filter Tech
James Briggs
Build NLP Pipelines with HuggingFace Datasets
James Briggs
Composite Indexes and the Faiss Index Factory
James Briggs
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Want to get started with deep learning
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Medium · Deep Learning
Chapters (9)
Intro
0:16
BERT Base Network
1:11
Sentence Vectors and Similarity
1:47
The Data and Model
3:01
Two Approaches
3:16
Tokenizing Sentences
9:11
Creating last_hidden_state Tensor
11:08
Creating Sentence Vectors
17:53
Cosine Similarity
🎓
Tutor Explanation
DeepCamp AI