Sentence Similarity With Sentence-Transformers in Python

James Briggs · Beginner ·🔍 RAG & Vector Search ·5y ago

Key Takeaways

This video demonstrates how to use the sentence-transformers library in Python to compare similarity between different sentences, utilizing the BERT model for creating sentence embeddings and the sklearn library for calculating cosine similarity.

Full Transcript

and welcome to this video on using the sentence transformers library to compare similarity between different sentences so this generally pretty short video i'm not going to go really into depth i'm just going to show you how to actually use the library now if you do want to go into a little more depth i have another video that i'll be releasing just before this one and that will go into what is actually happening here how we are calculating similarity or or pulling the how the model that we'll be using is actually creating those embeddings and then how we're actually calculating the similarity there so if you're interested in that go check it out otherwise if you just want to get a quick similarity score between two sentences this is probably the way to go so we have these six sentences up here and this one three years later the coffin was still full of jello and this one the person box was packed with jelly many dozens of months later they're saying the same thing but the second one is saying in a way that most of us wouldn't normally say it instead of saying coughing we're saying person box instead of jello we're saying jelly i think that's kind of normal actually and instead of years we're saying dozens of months so it's not really sharing the same words but we're going to see that we can actually find that these two sentences are the most similar out of all of these so we're taking those and we're going to be importing the sentence transformers library and we want to import the sentence transformer and then from that we want to initialize a sentence transformer model so we write sentence transformer and then in here we're going to be using this model that i've already defined a model name for which is the bert base mli mean tokens model so initialize that i need to rerun that so we have our model and i'll just show you really quickly this model is coming from the hugging face transformers library behind sentence transformers so this is the actual model we are using now first thing we do here is create our sentence vectors or sentence embeddings so we'll call a sentence vex equals model and code and all we need to do here is pass our sentences so we can pass a single sentence or a list of sentences it's completely fine and then let's just have a quick look at what we have here so you see that we have this big array and if we look at the shape we see that we have a 6 by 768 array so the six refers to our six sentences here and the 768 refers to the hidden state size within the bert model that we're using so each one of these sentences is now being represented by a dense vector containing 768 values and that means that we already take those and compare similarity between them so to do that we're going to be using the sklearn implementation of cosine similarity which we can import like this so sklearn pairwise or metrics pairwise and we import cosine similarity and to calculate our cosine similarity all we do is take that function and inside here we pass our first sentence so this three years later the coffin is still full of jello i want to pass that sentence vector which is just in index zero of our sentence vector array and because we are extracting that single array value so if we just have a look at this you see that we have a almost like a list of lists here if we just extract this we only get a list so what we want to do is actually keep that inside a list otherwise we'll get dimension error and then we do sentence vex one onwards so this will be the remaining sentences okay so let's take these or let's just bring them down here calculate this and we can see that our highest similarity by quite a bit is just 0.72 now that means that between this sentence and this sentence we have a similarity score of 0.72 so clearly it's working it's scoring the highest similarity and you can play around this and and test multiple different words and sentences and just see how it works but that's the easy way putting all this together so i think it's really cool that we can do that so easily but i don't think there's really anything else to say about it so thank you for watching and i'll see you in the next one

Original Description

🎁 Free NLP for Semantic Search Course: https://www.pinecone.io/learn/nlp Hard mode: https://youtu.be/jVPd7lEvjtg All we ever seem to talk about nowadays are BERT this, BERT that. I want to talk about something else, but BERT is just too good  -  so this video will be about BERT for sentence similarity. A big part of NLP relies on similarity in highly-dimensional spaces. Typically an NLP solution will take some text, process it to create a big vector/array representing said text - then perform several transformations. It's highly-dimensional magic. Sentence similarity is one of the clearest examples of how powerful highly-dimensional magic can be. The logic is this: - Take a sentence, convert it into a vector. - Take many other sentences, and convert them into vectors. - Find sentences that have the smallest distance (Euclidean) or smallest angle (cosine similarity) between them - more on that here. - We now have a measure of semantic similarity between sentences - easy! At a high level, there's not much else to it. But of course, we want to understand what is happening in a little more detail and implement this in Python too. 🤖 70% Discount on the NLP With Transformers in Python course: https://bit.ly/3DFvvY5 Medium article: https://towardsdatascience.com/bert-for-measuring-text-similarity-eec91c6bf9e1 🎉 Sign-up For New Articles Every Week on Medium! https://medium.com/@jamescalam/membership 📖 If membership is too expensive - here's a free link: https://towardsdatascience.com/bert-for-measuring-text-similarity-eec91c6bf9e1?sk=c0f2990b4660210b447e52d55bd0f4e5 👾 Discord https://discord.gg/c5QtDB9RAP 🕹️ Free AI-Powered Code Refactoring with Sourcery: https://sourcery.ai/?utm_source=YouTub&utm_campaign=JBriggs&utm_medium=aff
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from James Briggs · James Briggs · 32 of 60

1 Stoic Philosophy Text Generation with TensorFlow
Stoic Philosophy Text Generation with TensorFlow
James Briggs
2 How to Build TensorFlow Pipelines with tf.data.Dataset
How to Build TensorFlow Pipelines with tf.data.Dataset
James Briggs
3 Every New Feature in Python 3.10.0a2
Every New Feature in Python 3.10.0a2
James Briggs
4 How-to Build a Transformer for Language Classification in TensorFlow
How-to Build a Transformer for Language Classification in TensorFlow
James Briggs
5 How-to use the Kaggle API in Python
How-to use the Kaggle API in Python
James Briggs
6 Language Generation with OpenAI's GPT-2 in Python
Language Generation with OpenAI's GPT-2 in Python
James Briggs
7 Text Summarization with Google AI's T5 in Python
Text Summarization with Google AI's T5 in Python
James Briggs
8 How-to do Sentiment Analysis with Flair in Python
How-to do Sentiment Analysis with Flair in Python
James Briggs
9 Python Environment Setup for Machine Learning
Python Environment Setup for Machine Learning
James Briggs
10 Sequential Model - TensorFlow Essentials #1
Sequential Model - TensorFlow Essentials #1
James Briggs
11 Functional API - TensorFlow Essentials #2
Functional API - TensorFlow Essentials #2
James Briggs
12 Training Parameters - TensorFlow Essentials #3
Training Parameters - TensorFlow Essentials #3
James Briggs
13 Input Data Pipelines - TensorFlow Essentials #4
Input Data Pipelines - TensorFlow Essentials #4
James Briggs
14 6 of Python's Newest and Best Features (3.7-3.9)
6 of Python's Newest and Best Features (3.7-3.9)
James Briggs
15 Novice to Advanced RegEx in Less-than 30 Minutes + Python
Novice to Advanced RegEx in Less-than 30 Minutes + Python
James Briggs
16 Building a PlotLy $GME Chart in Python
Building a PlotLy $GME Chart in Python
James Briggs
17 How-to Use The Reddit API in Python
How-to Use The Reddit API in Python
James Briggs
18 How to Build Custom Q&A Transformer Models in Python
How to Build Custom Q&A Transformer Models in Python
James Briggs
19 How to Build Q&A Models in Python (Transformers)
How to Build Q&A Models in Python (Transformers)
James Briggs
20 How-to Decode Outputs From NLP Models (Python)
How-to Decode Outputs From NLP Models (Python)
James Briggs
21 Identify Stocks on Reddit with SpaCy (NER in Python)
Identify Stocks on Reddit with SpaCy (NER in Python)
James Briggs
22 Sentiment Analysis on ANY Length of Text With Transformers (Python)
Sentiment Analysis on ANY Length of Text With Transformers (Python)
James Briggs
23 Unicode Normalization for NLP in Python
Unicode Normalization for NLP in Python
James Briggs
24 The NEW Match-Case Statement in Python 3.10
The NEW Match-Case Statement in Python 3.10
James Briggs
25 Multi-Class Language Classification With BERT in TensorFlow
Multi-Class Language Classification With BERT in TensorFlow
James Briggs
26 How to Build Python Packages for Pip
How to Build Python Packages for Pip
James Briggs
27 How-to Structure a Q&A ML App
How-to Structure a Q&A ML App
James Briggs
28 How to Index Q&A Data With Haystack and Elasticsearch
How to Index Q&A Data With Haystack and Elasticsearch
James Briggs
29 Q&A Document Retrieval With DPR
Q&A Document Retrieval With DPR
James Briggs
30 How to Use Type Annotations in Python
How to Use Type Annotations in Python
James Briggs
31 Extractive Q&A With Haystack and FastAPI in Python
Extractive Q&A With Haystack and FastAPI in Python
James Briggs
Sentence Similarity With Sentence-Transformers in Python
Sentence Similarity With Sentence-Transformers in Python
James Briggs
33 Sentence Similarity With Transformers and PyTorch (Python)
Sentence Similarity With Transformers and PyTorch (Python)
James Briggs
34 NER With Transformers and spaCy (Python)
NER With Transformers and spaCy (Python)
James Briggs
35 Training BERT #1 - Masked-Language Modeling (MLM)
Training BERT #1 - Masked-Language Modeling (MLM)
James Briggs
36 Training BERT #2 - Train With Masked-Language Modeling (MLM)
Training BERT #2 - Train With Masked-Language Modeling (MLM)
James Briggs
37 Training BERT #3 - Next Sentence Prediction (NSP)
Training BERT #3 - Next Sentence Prediction (NSP)
James Briggs
38 Training BERT #4 - Train With Next Sentence Prediction (NSP)
Training BERT #4 - Train With Next Sentence Prediction (NSP)
James Briggs
39 FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
James Briggs
40 New Features in Python 3.10
New Features in Python 3.10
James Briggs
41 Training BERT #5 - Training With BertForPretraining
Training BERT #5 - Training With BertForPretraining
James Briggs
42 How-to Use HuggingFace's Datasets - Transformers From Scratch #1
How-to Use HuggingFace's Datasets - Transformers From Scratch #1
James Briggs
43 Build a Custom Transformer Tokenizer - Transformers From Scratch #2
Build a Custom Transformer Tokenizer - Transformers From Scratch #2
James Briggs
44 3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
James Briggs
45 3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
James Briggs
46 Building MLM Training Input Pipeline - Transformers From Scratch #3
Building MLM Training Input Pipeline - Transformers From Scratch #3
James Briggs
47 Training and Testing an Italian BERT - Transformers From Scratch #4
Training and Testing an Italian BERT - Transformers From Scratch #4
James Briggs
48 Faiss - Introduction to Similarity Search
Faiss - Introduction to Similarity Search
James Briggs
49 Angular App Setup With Material - Stoic Q&A #5
Angular App Setup With Material - Stoic Q&A #5
James Briggs
50 Why are there so many Tokenization methods in HF Transformers?
Why are there so many Tokenization methods in HF Transformers?
James Briggs
51 Choosing Indexes for Similarity Search (Faiss in Python)
Choosing Indexes for Similarity Search (Faiss in Python)
James Briggs
52 Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
James Briggs
53 How LSH Random Projection works in search (+Python)
How LSH Random Projection works in search (+Python)
James Briggs
54 IndexLSH for Fast Similarity Search in Faiss
IndexLSH for Fast Similarity Search in Faiss
James Briggs
55 Faiss - Vector Compression with PQ and IVFPQ (in Python)
Faiss - Vector Compression with PQ and IVFPQ (in Python)
James Briggs
56 Product Quantization for Vector Similarity Search (+ Python)
Product Quantization for Vector Similarity Search (+ Python)
James Briggs
57 How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
James Briggs
58 Metadata Filtering for Vector Search + Latest Filter Tech
Metadata Filtering for Vector Search + Latest Filter Tech
James Briggs
59 Build NLP Pipelines with HuggingFace Datasets
Build NLP Pipelines with HuggingFace Datasets
James Briggs
60 Composite Indexes and the Faiss Index Factory
Composite Indexes and the Faiss Index Factory
James Briggs

This video teaches how to use the sentence-transformers library to compare sentence similarity using BERT and cosine similarity. It provides a step-by-step guide on how to create sentence embeddings and calculate similarity scores.

Key Takeaways
  1. Import the sentence-transformers library
  2. Initialize a sentence transformer model
  3. Create sentence vectors or sentence embeddings
  4. Calculate cosine similarity between sentences using sklearn
  5. Compare similarity scores between different sentences
💡 The sentence-transformers library provides an easy-to-use interface for creating sentence embeddings and calculating similarity scores, making it a useful tool for natural language processing tasks.

Related Reads

📰
When “Smart” Parsers Fail: Building a Hallucination-Resistant RAG System for the Constitution of…
Learn how to build a hallucination-resistant RAG system to improve AI performance on specific tasks, and why deterministic engineering can be a better approach in certain cases
Medium · Python
📰
Semantic Observability: Engineering Reliability for Production RAG
Learn to engineer reliability for production RAG using semantic observability to identify and fix microservice failures quickly and efficiently
Dev.to · Dumebi Okolo
📰
Stale RAG vs. expensive RAG: how to cache RAG context without serving outdated answers
Learn how to cache RAG context without serving outdated answers to improve the efficiency and accuracy of your Retrieval-Augmented Generation (RAG) system
Dev.to · Vectorlink Labs
📰
Why vector-only RAG is weak for coding agents
Learn why vector-only RAG is insufficient for coding agents and how a local structural + semantic code-memory engine can improve their performance
Dev.to · lorismascio17
Up next
RRF vs DBSF with Qdrant: Hybrid Retrieval Fusion for RAG in Python
Professor Py: AI Engineering
Watch →