Sentence Similarity With Transformers and PyTorch (Python)

James Briggs · Beginner ·🧬 Deep Learning ·5y ago

Key Takeaways

This video demonstrates how to use BERT and PyTorch to calculate sentence similarity, utilizing techniques such as mean pooling and cosine similarity to compare sentence embeddings. The video showcases the use of Sentence Transformers, AutoTokenizer, and AutoModel from pre-trained models to create 768-dim embeddings for sentences.

Full Transcript

today we're going to have a look at how we can use transformers like bert to create embeddings for sentences and how we can then take those sentence vectors and use them to calculate the semantic similarity between different sentences so at a high level what you can see on the screen right now is a base model inside workbase we have multiple encoders and at the bottom we can see we have our tokenized text we have 512 tokens here and they get passed into our first encoder to create these hidden state vectors which are of the size 768 in bert now these get processed through multiple encoders and between every one of these encoders that's 12 in total there are going to be a vector of size 768 for every single token that we have so 512 tokens in this case now what we're going to do is take the final tensor out here so this last hidden state tensor and we're going to use mean pooling to compress it into a 760 by one vector and that is our sentence vector then once we've built our sentence vector we're going to use cosine similarity to compare different sentences and see if we can get something that works so switching across to python these are the sentences we're going to be comparing and there's two so there's this one here which is three years later the coffin was still full of jello and that has the same meaning as this here i just rewrote it but with completely different words so i don't think there's really any words here that match so instead of years we have dozens of months jelly jello coffin person box all right no normal human would even say that sex well no normal human would probably say either of those but we definitely wouldn't use person box for coffin and many dozens of months for years so it's reasonably complicated but we'll see that this should work for similarities so we'll find that these two share the highest similarity score after we've encoded them with bet and calculate our codes on similarity and down here is the model we'll be using so we're going to be using sentence transformers and then the bert based nli mean tokens model now there's two approaches that we can take here the easy approach using something called sentence transformers i'm going to be covering that in another video and this approach which is a little more involved where we're going to be using transformers and pi torch so the first thing we need to do is actually create our last hidden state tensor so of course we need to import the libraries that we're going to be using so transformers we're going to be using the auto tokenizer and the auto model and then we need to import torch as well and then after we've imported these we need to first initialize our tokenizer model which we just do auto tokenizer and then for both these we're going to use from pre-trained [Music] and we're going to use the model name that we've already defined so these are coming from face library obviously and we can see the model here so it's this one and then our model is auto model from pre-trained again from those and now what we want to do is tokenize all of our sentences now to do this we're going to use a tokens dictionary and in here we're going to have input ids and this will contain a list and you'll see why in a moment and attention mask which will also contain a list now when we're going through each sentence we have to do this one by one for sentence in sentences we are going to be using the tokenizers encode plus method so tokenizer encode plus and then in here we need to pass our sentence we need to pass the maximum length of our sequence so with bert usually we would set this to 512 but because we're using this bert based nli mean tokens model this should actually be set to 128 so we set max length to 128 and anything longer than this we want to truncate so we set truncation equal to true and anything shorter than this which they all will be in our case we set padding equal to the max length to pad it up to that much length and then here we want to say return [Music] tensors and we set this equal to pt because we're using pi torch now this will return a dictionary containing input ids and attention mask for a single sentence so we'll take the new tokens assign it to that variable and then what we're going to do is access our tokens dictionary input ids first and append the input ids for the single sentence from the new tokens variable so input ids and then we do the same for our attention mask okay so that gives us those there's another thing as well we these are wrapped as vectors so we also want to just extract the first element there because it's they're like almost like lists within a list but in intensive format and we want to extract the list now that's good but obviously we're using pi torch here we want pi torch tensors not list so within these lists we do have pytorch tensors so in fact let me just show you so if we have a look in here we'll see that we have our pie touch sensors but they're contained within a normal python list so we can even check that if we do type we see it we get lists and inside there we have the torch tensor which is what we want for all of them so to convert this list of pi touch tensors into a single pi torch tensor what we do is we take this torch and we use the stack method and what the sac method does is takes a list and within that list rule let's put pytorch tensors and it will stack all of those on top of each other essentially adding another dimension and stacking them all on top of each other which hence the hence the name so take that and we want to do it for both input ids and the tension mask and then let's have a look at what we have so let's go attention or input ids and now we just have a single tensor okay so we type and now we just have a tensor now that's great check its size so we have six sentences that have all been encoded into the 128 tokens ready to go into our model so to process these through our model we'll output the outputs to this outputs variable and we take our model and we pass our tokens as keyword arguments into the model input there so we process that and that will give us this output object and inside this ip object we have the last hidden state tensor here and we can also see that if we print out keys you see that we have less than say and we also have this pooler output now we want to take our last hidden state tensor and then perform the mean pooling operation to convert it into a sentence vector so to get that last hidden state we will assign it to this embeddings variable and we extract it using hidden or last hidden state like that and let's just check what we have here so we'll just hold good shape and you see now we have the six sentences we have the 128 tokens and then we have the 768 dimension size which is just the hidden state dimensions within bert so what we have at the moment is this last hidden state tensor and what we're going to do is now convert it into this using a mean pulling operation so the the first thing we need to do is multiply every value within this last hidden state tensor by zero where we shouldn't have a real token so if we look up here we've padded all of these and obviously there's more padding tokens in this sentence than there are in this sentence so we need to take each of those attention mass tenses that we took here which just contain ones and zeros ones where there's real tokens at zero is where there are padding tokens and multiply that out to remove any activations where there should just be padding tokens eg zeros now the only problem is that if we have a look at our attention mask so tokens attention mass if we have a look at the size we get a 6 by 128 so what we need to do is add this other dimension which is the 768 and then we can just multiply those two tensors together and this will remove the embedding of values where there shouldn't be embedding values and to do that we'll we'll assign it to mass but we'll do it later actually so attention and what i want to do is use the unsqueeze method and if we so look at the shape so we can see what is actually happening here see that we've added this other dimension and then what that allows us to do is expand that dimension out to 768 which will then match to the correct shape that we need to multiply those two together so we do expand and here what we want is we'll take embeddings and we want to expand it out to the embeddings shape that we have already use up here so that will compare these two and see that we need to expand this one dimension out to 768 and if we execute that we can see that it has worked so the final thing that we need to do there is convert that into a float tensor and then we assign that to the mass here so this uh float at the end that's just converting it from integer to float so now what we can do is apply this mask to our embeddings so we'll call this one mask embeddings and it is very simple we just do embeddings multiplied by mask and now if we just compare embeddings have a look what we have here so it's quite a lot and now we have a look at mass embeddings and you see here that we have the same values here so looking at the top these are the same but then these values here have been mapped to zero because they are just padding tokens we don't want to pay attention to those so that's the point of the masking operation there so remove those and now what we want to do is take all of those embeddings because if we have a look at the shape that we have we still have this 128 tokens we want to convert this into one token and there's two operations that we need to do here so we're doing a mean pooling operation so we need to calculate the sum within each of these so if we summed all these up together that's what we are going to be doing and pushing them into a single value and then we also need to count all of those values but only where we were supposed to be paying attention so when we converted them into zeros we don't want to count those values and then we divide that sum by the count to get our mean so to get the summed we do torch dot sum and then just mass embeddings and this is in the dimension one which is this dimension here let's have a look at the shape that we have here okay so now we can see that we've removed this dimension and now what we want to do is create our counts and to do this we use a slightly different approach we just do torch clamp and then inside here we do mass dot sum again in the dimension one and then we also have we also add a min argument here which just stops us from creating any divide by zero error so we do one e and all this needs to be is a very small number i think by default it's one e to the minus eight but i usually just use one e to the minus nine although in reality it shouldn't really make a difference and sorry just put counts there okay so that's our sum and our counts and now we get the mean pulled so we do mean board equals summed divided by the counts and we'll just check the size of that again okay so that is our sentence vector so we have six of them here each one contains just 768 values and let's have a look at what they look like we just get these values here now what we can do is compare each of these and see which ones get the highest cosine similarity value now we're going to be using the sk learn implementation which is metrics dot pairwise we import cosine similarity and then this would expect numpy arrays obviously we have pi touch tensors so we are going to get an error i'm gonna i'm going to show you so you at least see it you know how to fix it so the cosine similarity and in here we want to pass a single vector that we are going to be comparing so i'm going to compare the first text sentence so if we just take these and put them down here so i'm going to take the very first one of those which is mean pulled 0 and because we are extracting this out directly that means we get a it's like a list format we want it to be in a vector format so it's a list within the list and then we want to extract the remaining that's it five yeah five sentences so go one all the way to the end so that says last five there now if we run this we're going to get this runtime error we go down and we see current call numpy on tensor that requires grad so this is just with pi torch we this tensor is currently within our pi torch model and we need to detach it from pythog in order to convert it into something that pi torch cannot read anymore and it actually tells us exactly what i need to do so use tensor detach numpy instead so we take detach and numpy and all we need to do is write mean pooled equals that we run it and we get our similarity scores so straight away we got .33 one seven four four five five this one is the one the high similarity point 72 by a fair bit as well so that is comparing this sentence and sentence at index one of our last five which is this one so there we've calculated similarity and it is clearly working so that's it for this video i hope it's been useful i think this is really cool and i'll see you in the next one

Original Description

Easy mode: https://youtu.be/Ey81KfQ3PQU All we ever seem to talk about nowadays are BERT this, BERT that. I want to talk about something else, but BERT is just too good  -  so this video will be about BERT for sentence similarity. A big part of NLP relies on similarity in highly-dimensional spaces. Typically an NLP solution will take some text, process it to create a big vector/array representing said text - then perform several transformations. It's highly-dimensional magic. Sentence similarity is one of the clearest examples of how powerful highly-dimensional magic can be. The logic is this: - Take a sentence, convert it into a vector. - Take many other sentences, and convert them into vectors. - Find sentences that have the smallest distance (Euclidean) or smallest angle (cosine similarity) between them - more on that here. - We now have a measure of semantic similarity between sentences - easy! At a high level, there's not much else to it. But of course, we want to understand what is happening in a little more detail and implement this in Python too. 🤖 70% Discount on the NLP With Transformers in Python course: https://bit.ly/3DFvvY5 Medium article: https://towardsdatascience.com/bert-for-measuring-text-similarity-eec91c6bf9e1 🎉 Sign-up For New Articles Every Week on Medium! https://medium.com/@jamescalam/membership 📖 If membership is too expensive - here's a free link: https://towardsdatascience.com/bert-for-measuring-text-similarity-eec91c6bf9e1?sk=c0f2990b4660210b447e52d55bd0f4e5 👾 Discord https://discord.gg/c5QtDB9RAP 🕹️ Free AI-Powered Code Refactoring with Sourcery: https://sourcery.ai/?utm_source=YouTub&utm_campaign=JBriggs&utm_medium=aff 00:00 Intro 00:16 BERT Base Network 1:11 Sentence Vectors and Similarity 1:47 The Data and Model 3:01 Two Approaches 3:16 Tokenizing Sentences 9:11 Creating last_hidden_state Tensor 11:08 Creating Sentence Vectors 17:53 Cosine Similarity
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from James Briggs · James Briggs · 33 of 60

1 Stoic Philosophy Text Generation with TensorFlow
Stoic Philosophy Text Generation with TensorFlow
James Briggs
2 How to Build TensorFlow Pipelines with tf.data.Dataset
How to Build TensorFlow Pipelines with tf.data.Dataset
James Briggs
3 Every New Feature in Python 3.10.0a2
Every New Feature in Python 3.10.0a2
James Briggs
4 How-to Build a Transformer for Language Classification in TensorFlow
How-to Build a Transformer for Language Classification in TensorFlow
James Briggs
5 How-to use the Kaggle API in Python
How-to use the Kaggle API in Python
James Briggs
6 Language Generation with OpenAI's GPT-2 in Python
Language Generation with OpenAI's GPT-2 in Python
James Briggs
7 Text Summarization with Google AI's T5 in Python
Text Summarization with Google AI's T5 in Python
James Briggs
8 How-to do Sentiment Analysis with Flair in Python
How-to do Sentiment Analysis with Flair in Python
James Briggs
9 Python Environment Setup for Machine Learning
Python Environment Setup for Machine Learning
James Briggs
10 Sequential Model - TensorFlow Essentials #1
Sequential Model - TensorFlow Essentials #1
James Briggs
11 Functional API - TensorFlow Essentials #2
Functional API - TensorFlow Essentials #2
James Briggs
12 Training Parameters - TensorFlow Essentials #3
Training Parameters - TensorFlow Essentials #3
James Briggs
13 Input Data Pipelines - TensorFlow Essentials #4
Input Data Pipelines - TensorFlow Essentials #4
James Briggs
14 6 of Python's Newest and Best Features (3.7-3.9)
6 of Python's Newest and Best Features (3.7-3.9)
James Briggs
15 Novice to Advanced RegEx in Less-than 30 Minutes + Python
Novice to Advanced RegEx in Less-than 30 Minutes + Python
James Briggs
16 Building a PlotLy $GME Chart in Python
Building a PlotLy $GME Chart in Python
James Briggs
17 How-to Use The Reddit API in Python
How-to Use The Reddit API in Python
James Briggs
18 How to Build Custom Q&A Transformer Models in Python
How to Build Custom Q&A Transformer Models in Python
James Briggs
19 How to Build Q&A Models in Python (Transformers)
How to Build Q&A Models in Python (Transformers)
James Briggs
20 How-to Decode Outputs From NLP Models (Python)
How-to Decode Outputs From NLP Models (Python)
James Briggs
21 Identify Stocks on Reddit with SpaCy (NER in Python)
Identify Stocks on Reddit with SpaCy (NER in Python)
James Briggs
22 Sentiment Analysis on ANY Length of Text With Transformers (Python)
Sentiment Analysis on ANY Length of Text With Transformers (Python)
James Briggs
23 Unicode Normalization for NLP in Python
Unicode Normalization for NLP in Python
James Briggs
24 The NEW Match-Case Statement in Python 3.10
The NEW Match-Case Statement in Python 3.10
James Briggs
25 Multi-Class Language Classification With BERT in TensorFlow
Multi-Class Language Classification With BERT in TensorFlow
James Briggs
26 How to Build Python Packages for Pip
How to Build Python Packages for Pip
James Briggs
27 How-to Structure a Q&A ML App
How-to Structure a Q&A ML App
James Briggs
28 How to Index Q&A Data With Haystack and Elasticsearch
How to Index Q&A Data With Haystack and Elasticsearch
James Briggs
29 Q&A Document Retrieval With DPR
Q&A Document Retrieval With DPR
James Briggs
30 How to Use Type Annotations in Python
How to Use Type Annotations in Python
James Briggs
31 Extractive Q&A With Haystack and FastAPI in Python
Extractive Q&A With Haystack and FastAPI in Python
James Briggs
32 Sentence Similarity With Sentence-Transformers in Python
Sentence Similarity With Sentence-Transformers in Python
James Briggs
Sentence Similarity With Transformers and PyTorch (Python)
Sentence Similarity With Transformers and PyTorch (Python)
James Briggs
34 NER With Transformers and spaCy (Python)
NER With Transformers and spaCy (Python)
James Briggs
35 Training BERT #1 - Masked-Language Modeling (MLM)
Training BERT #1 - Masked-Language Modeling (MLM)
James Briggs
36 Training BERT #2 - Train With Masked-Language Modeling (MLM)
Training BERT #2 - Train With Masked-Language Modeling (MLM)
James Briggs
37 Training BERT #3 - Next Sentence Prediction (NSP)
Training BERT #3 - Next Sentence Prediction (NSP)
James Briggs
38 Training BERT #4 - Train With Next Sentence Prediction (NSP)
Training BERT #4 - Train With Next Sentence Prediction (NSP)
James Briggs
39 FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
James Briggs
40 New Features in Python 3.10
New Features in Python 3.10
James Briggs
41 Training BERT #5 - Training With BertForPretraining
Training BERT #5 - Training With BertForPretraining
James Briggs
42 How-to Use HuggingFace's Datasets - Transformers From Scratch #1
How-to Use HuggingFace's Datasets - Transformers From Scratch #1
James Briggs
43 Build a Custom Transformer Tokenizer - Transformers From Scratch #2
Build a Custom Transformer Tokenizer - Transformers From Scratch #2
James Briggs
44 3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
James Briggs
45 3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
James Briggs
46 Building MLM Training Input Pipeline - Transformers From Scratch #3
Building MLM Training Input Pipeline - Transformers From Scratch #3
James Briggs
47 Training and Testing an Italian BERT - Transformers From Scratch #4
Training and Testing an Italian BERT - Transformers From Scratch #4
James Briggs
48 Faiss - Introduction to Similarity Search
Faiss - Introduction to Similarity Search
James Briggs
49 Angular App Setup With Material - Stoic Q&A #5
Angular App Setup With Material - Stoic Q&A #5
James Briggs
50 Why are there so many Tokenization methods in HF Transformers?
Why are there so many Tokenization methods in HF Transformers?
James Briggs
51 Choosing Indexes for Similarity Search (Faiss in Python)
Choosing Indexes for Similarity Search (Faiss in Python)
James Briggs
52 Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
James Briggs
53 How LSH Random Projection works in search (+Python)
How LSH Random Projection works in search (+Python)
James Briggs
54 IndexLSH for Fast Similarity Search in Faiss
IndexLSH for Fast Similarity Search in Faiss
James Briggs
55 Faiss - Vector Compression with PQ and IVFPQ (in Python)
Faiss - Vector Compression with PQ and IVFPQ (in Python)
James Briggs
56 Product Quantization for Vector Similarity Search (+ Python)
Product Quantization for Vector Similarity Search (+ Python)
James Briggs
57 How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
James Briggs
58 Metadata Filtering for Vector Search + Latest Filter Tech
Metadata Filtering for Vector Search + Latest Filter Tech
James Briggs
59 Build NLP Pipelines with HuggingFace Datasets
Build NLP Pipelines with HuggingFace Datasets
James Briggs
60 Composite Indexes and the Faiss Index Factory
Composite Indexes and the Faiss Index Factory
James Briggs

This video teaches how to use BERT and PyTorch to calculate sentence similarity, covering topics such as mean pooling, cosine similarity, and sentence embeddings. By the end of the video, viewers will be able to build their own sentence similarity models using BERT and PyTorch. The video is particularly useful for those interested in natural language processing and machine learning.

Key Takeaways
  1. Initialize tokenizer model and model
  2. Tokenize all sentences using encode_plus method
  3. Truncate sentences to 128 tokens and pad to max length
  4. Use PyTorch tensors instead of lists
  5. Stack PyTorch tensors to create a single tensor
  6. Use unsqueeze method to add dimension to attention mask
  7. Remove padding token embeddings by multiplying attention mask with embeddings
  8. Apply mean pooling operation to get sentence vector
  9. Calculate mean of embeddings by summing and dividing by count
  10. Compare sentence vectors using cosine similarity
💡 The use of mean pooling and cosine similarity allows for efficient and accurate calculation of sentence similarity, making it a valuable technique for natural language processing tasks.

Related AI Lessons

Want to get started with deep learning
Get started with deep learning by leveraging resources like Andrew Karpathy's playlist and frameworks such as TensorFlow or PyTorch
Reddit r/deeplearning
Building a Deepfake Detector From Scratch — What Nobody Tells You
Learn to build a deepfake detector from scratch and understand the challenges involved in detecting AI-generated fake media
Medium · Deep Learning
Unfolding the Meandering Path: High-Dimensional Invariance and the Flat 2D Plane of Neural…
Learn about high-dimensional invariance and its relation to the flat 2D plane of neural networks, and how to apply these concepts to improve model performance
Medium · Deep Learning
Implementing Neural Style Transfer from Scratch: The Project That Started It All
Learn to implement Neural Style Transfer from scratch and understand its significance in deep learning
Medium · Deep Learning

Chapters (9)

Intro
0:16 BERT Base Network
1:11 Sentence Vectors and Similarity
1:47 The Data and Model
3:01 Two Approaches
3:16 Tokenizing Sentences
9:11 Creating last_hidden_state Tensor
11:08 Creating Sentence Vectors
17:53 Cosine Similarity
Up next
Image Classification with ml5.js
The Coding Train
Watch →