3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

James Briggs · Beginner ·🔍 RAG & Vector Search ·5y ago

Key Takeaways

This video covers three vector-based methods for similarity search: TF-IDF, BM25, and SBERT

Full Transcript

we're going to cover three different vector-based similarity search methods we're going to explain how they work and try and get an intuition for why they work and we're also going to actually go ahead and implement each of these in python now for tf idf and bm25 they're both pretty similar and bm25 is actually a improved version of tfidf and we can classify both of these as being sparse vector methods so these both use sparse vectors which essentially means they're vectors with a lot of zeros in with the occasional value in there and then our bottom one down here which is sentence bert that's an example of a dense vector now dense vectors are quite interesting as they allow us to consider semantics or the meaning behind what we are saying rather than just the syntax and the words that we're using so that that's pretty interesting but all of these can work better than the others in in different situations i've had dense representations of language in some cases work worse than sparse representations and of course the other way around as well so we'll but we'll have a look and we'll cover all of those so let's get started with tf idf now tf idf consists of two different components as you can see the name tf idf now the first of those components is the tf which is the term frequency and term frequency does what it says on the tin it looks at a sentence or a paragraph and given a certain query which is the queue here it will tell you compared to the length of that document how frequent your query is so what we have here is this d which is our document so we're saying okay what is the frequency of the query queue in our document d then on the bottom down here we're saying what is the frequency of all terms in our document d so if we were to calculate this we would get 1 because we only have our one our query in this case is bananas divided by 18 which is the total number of terms in that document and that will give us 0.0 so that's our term frequency which is half of tf idf we still have idf so what what is idf how do we how do we calculate that so idf is the inverse document frequency and it's calculated for each document so each so when we say document here we mean it can be a sentence or a paragraph essentially what you see here we have abc these are on those are documents in this case each one of them is a sentence it could be a paragraph it could be a mix of different things right in this case we have these sentences and what we're saying here is the inverse document frequency is the log of the number of documents so this in this case we would have three over the number of documents that contain the query is or the word is which in this case is all of them so we have all three documents there and so what we get from there is the value one now don't forget that this is a log here so it's not actually one it's if i left myself a little bit space it's log and then and then one log one which is equal to zero so is is a very common word and because it's so common we literally do not care about it and this idf is multiplied by our term frequency so we would have tf down here so if our term that we were searching for was is then all of these sequences would get a value of zero which might seem counterintuitive but if your if your query is something like is is the best city in the forest right you know from the from the top sentence there you wouldn't want to be prioritizing the word is or the or in because those words are everywhere so we don't really care about those ones we want to care about the more unique words and that's what you would get down here so we have a query for forest the number of documents exactly the same still three and how many the word forest so only one of them we have a at the top so we would have log three and what is log three log three is zero 0.48 so then we times that by tf and we get a larger number than we would for is so that is where the idf or inverse document frequency comes in it essentially multiply gives us a multiplier for less common words because chances are they're more relevant to our query now what we can see here is just a work through of what we just looked at so we're calculating the tf for each document and the word is up here we calculate the idf and then we multiply those two together and because is is everywhere the idf is e equals zero and therefore the outcome is always zero because it's just not a relevant word that we care about now on the other hand the word forest when we calculate when we calculate the tf idf for each of these the tf for b and c is zero because the term just doesn't appear in those documents and then the idf is higher because forest is a rarer word so it's 0.48 now the outcome of that is that only this value here so this is the tf idf for the document a is not zero and that that's essentially how tf idf works now let's jump across to python let's have a quick look at how we'd write down code so we have we have those three sentences again up here so let's run that and then just here we can see okay we import numpy we're using numpy here we're merging all the documents here now the reason we do this so dimers into a list of lists is just so we can more easily calculate the terms n and also the term n q equals our query which you can see just here in the idf section now once we do get to tf idf here so we calculate the idf which i just explained anyway and then also the term frequency which is just the number of matches for our specific word in a given sentence so let's run that and let's have a look tf idf for the word let's go over is like we did before in a what do we get we get zero as we as we calculated before and how about forests and here we get the 0.06 obviously it's more precise here but before we got 0.06 in our other screen now that's tf idf reasonably straightforward however i did say that these are vectors right these are sparse vectors obviously this doesn't look much like a vector so how do we turn this into a vector well what we do is we take our vocab so the all of the words that we have in all of our documents so a b and c so if we so we create our vocab and all we want to do is we take a set of our three documents a plus b plus c and this is just going to create a set containing all of the words across all of our documents so so quick look at what we what we get from there vocabulary is just all every single word that we have in there now to create our vector what we want to do is take that vocab and mirror it into a vector so for every word here we're going to calculate the tf idf for each document and saw that in a list which creates our tf idf vector so let me show you how that works so let's first initialize our vector so vector a and what we're going to do is say for word in vocab vector a dot append and then here we're going to call tf idf and then we have our word so we define that as word and our sentence in this case we're doing it for sentence a so to write sentence a like that and we can do this for each of our vectors let's just do a and b for now so b here and here and here okay and let's have a look what we get so we have vector a and we see that we get essentially a vector now we don't have that many documents or that many words here so we're just getting the same number here which is meaning that it appears in one document and that document is is a now let's have a look at b let me see we get a few different values here and that that's how that's our tf idf vector that's how we build it so that's tf idf let's move on to bm25 now bm 25 is a optimized version of tf idf so one of the problems of tf idf is as the frequency of queries found in a document increase the score increased linearly so imagine you have an article and in that 1000 word article it has the word dog maybe ten times now there's a good chance that uh that article was talking about dogs right now if you double the number of words that mention that are dog to 20 the tf idf score for that document will double as well and that if you think about it logically is a document or an article that has the word dog 10 times half as relevant as this same article that has the word dog 20 times probably not maybe it's maybe is slightly less relevant but not quite that much and that's why bm25 comes in it almost normalizes that value and we'll we'll visualize that soon as well but let's explain this because it looks pretty horrific now oh here at the top we have very similar to tfidf we have our essentially term frequency here and then we have our idf here now in our term frequency we can see that we have these two values here and this is the frequency of the query frequency of terms and this is pulled it's straight from tf idf but then we have a few new terms around here so we have k which you can also see here as well now k and b here are adjustable special parameters k would typically be in the range of 1.25 by default and b around 0.75 and we can modify these base on our use case to optimize the algorithm now another new one that we have over here is this d average now this d here that is the current document length so our d length and then over here we have the average document like so that's all of our documents so it's all of the documents their length divided by the number of documents so that would be n now that is the tf component and how it's been modified and then let's have a look at the idf component now the idf component we see these terms here and here which are exactly the same as the two terms that we have here so it's the number of documents versus the number of documents that contain that query and then over here we just have a few constant values so that although bm the bm25 algorithm looks pretty complex it's not that much different from the etf idf algorithm so let's go ahead and implement that in python now over in python we have these sentences or documents here we're going to add the words together in dots just makes it a little bit easier for us and we still have this tf idf function now i want to look at this and compare it to what we would do for bm25 so the first thing is let me just make this a little more readable so sentence account here we turn it into frequency okay so change that to frequency like so now we do the same here we sum in our frequency and then this is our term frequency component of tf idf and bm25 now let's have a look at these so we have k and b those are our adjustable parameters the only sort of new thing we have here really is the average dl which is the average length of all documents so how do we calculate that we just calculate outside function because this is for all documents it doesn't matter which sentence or which word we're looking at it's going to be the same for each one so we just sum the length of each sentence for every sentence in our dots in fact this here we can change that to two dots like so so we change that okay now let's let's just run it quickly see what we get so bm25 purple and in sentence b there is no purple so we would expect zero which is is what we get so we know we know this is at least working right so let's do the same for sentence a and we should get a score so 1.76 which is is what we'd expect so that that looks good now the other bit that is slightly different this is really quite simple is this bit here so the idf component now idf up here what do we do we are just taking the length of the documents dividing by the sum here and that's exactly what we do here so this here we could rewrite if we want to match 2d below as n underscore q like so and that's the only real difference the other part is so here we're using log to the base 10 and down here we're using the natural logarithm so there's a small there's also that difference as well but otherwise we're just adding the 0.5 and the ones and rearranging the n and the n eq and of course this here would be bn now that's bm25 and tf idf but what does that actually look like so when we compare these two algorithms well tf idf it's the score for tf idf increases linearly with the frequency for the number of words or number of matching terms so it goes up like this okay so that that's tf idf now bm 25 is slightly different bm 25 instead of going up like that it does something which looks more like this so it increases a lot quickly and then it kind of levels off so if you have you know four words for bm25 or you have eight like eight will become slightly more relevant the score will show that slightly more relevant but not that much now for tf idf the difference is the the relevance is doubled and you know depending on your use case maybe that is more along the lines of what you want but i do believe in most cases bm25 is probably more realistic so let's say two let's move on to our final one dense vectors with sentence bet now our final algorithm is expert so expert uses something called dense representations of language and what that means is rather than having a vector like we saw with tf idf and which is the same for bm25 where it has a lot of zeros and then the the odd number here and there which is called a sparse vector we have something which has many more values in there so there's there's essentially more information crammed into a smaller space and the effect of this is that you can represent language in a more meaningful manner so the word for example hi would be in a very similar space to the word hello and so would words like days of the week so monday tuesday and so on and the way it sentence works is that we have a transform model we have bert and our words or our query is process is it comes in through here it's processed by many bert encoding layers and then we get this dense vector now as well as our query so q so are our documents they are also processed through through the same encoder network and that produces dense vectors now once we have both of those dense vectors so we have this can be q and then we also have d we use cosine similarity between both of those to calculate how similar they are so how close to each other they are the angle between them and that works pretty much like this so we have over here we have sort of a lone vector and then over here we have two vectors which are much more similar they or at least they share the same direction and these two have a much smaller angle between them than either of those two do with this other vector which is all the way over there and that's essentially how cosine similarity works it just finds the angular angle between two vectors and where the angle is smaller they are more similar where the angle is greater they are less similar now we're not going to implement that from scratch because that would take a long time and be uh pretty hard to do to be honest so what we're going to do is use the sentence transformers library which is a very very good library that uses hugging faces transformers under the hood and has super easy implementations of sentence transformers which is essentially what we just saw but we're using it i was using the example of this specific one called sentence bet which is also what we're going to be using here so we'll run this we're using this bert base nli mean tokens model and we have all of our sentences here so first thing we need to do is we initialize our model here and then after that we encode all of our sentences with model.encode so we do that and we come down here so here we've we've encoded so what this has produced is the sentence embedding so once the text has been processed by albert model it outputs these sentence embeddings and they are vectors that represent the full sentence or the full document that we have input now once we get those out into our sentence embeddings variable there we use the cosine similarity function now here we're just going to use it we're going to use psychic lens implementation and what we're going to do because we have many sentence embeddings here let's let's just have a look so we have sentence embeddings let's have a look at the shape and you see that we have seven sets of embeddings here this here the 768 is the size of the of a single sentence embedding vector now we have seven why do we have seven because we have seven sentences up here so let's what we're going to do here is loop through each one and calculate the cosine similarity between that sentence and all of the other sentences so run that and we will get this array which shows us all of our scores between all of our sentences all possible combinations okay now that's pretty hard to look at so let's visualize it using matplotlib okay and we get this nice visual here so we can see here we have a on the left and it has a very high similarity to the vector a right so i haven't filtered those out or anything we are comparing the same vectors to the same vector and as you would expect they they get a very good score because they are exactly the same they all score one so there's no problem there we just ignore those ones in the middle now let's go on to the next one so we have we have these here so c and b okay those two seem to have very high similarities 0.72 super high so let's have a look so b and c now these are the two that both talk about throwing bananas or actually you know b talks about throwing bananas onto the street c talks about finding bananas on the street so yeah they're pretty similar right but they have a lot of the same words so fair enough tfidf and bm25 both identify those as being similar as well so that's it's good but it doesn't blow you away now how about b here and g so what i've done for g here is i've just taken the sentence b and swap the words around so that we don't really have matching words so throwing bananas instead of throwing i'm using word bombard instead of the word street i'm using road instead of bananas i'm using yellow fruit right so has a similar sort of meaning but they're not using the same word so tf idf and bm25 would would rarely struggle here now let's go down and let's hold up so b and g so we have b here g at the end and yeah they they have a really good score 0.66 says their second highest behind b and c so that just goes to show that sentence bur or sentence transformers in general don't require the same words to use they rely more on the semantic meaning of those words which is i think incredibly cool so that's it for this video we've covered quite a lot the sparse vectors tf idf and bm25 and also dense vector sensors bear or sentence transformers in general so i hope you've enjoyed the video thank you for watching and i'll see you in the next

Original Description

Vector similarity search is one of the fastest-growing domains in AI and machine learning. At its core, it is the process of matching relevant pieces of information together. Similarity search is a complex topic and there are countless techniques for building effective search engines. In this video, we'll cover three vector-based approaches for comparing languages and identifying similar 'documents', covering both vector similarity search and semantic search: - TF-IDF - BM25 - Sentence-BERT 📰 Original article: https://www.pinecone.io/learn/semantic-search/ 🤖 70% Discount on the NLP With Transformers in Python course: https://bit.ly/3DFvvY5 🎉 Sign-up For New Articles Every Week on Medium! https://medium.com/@jamescalam/membership Mining Massive Datasets Book (Similarity Search): 📚 https://amzn.to/3CC0zrc (3rd ed) 📚 https://amzn.to/3AtHSnV (1st ed, cheaper) 👾 Discord https://discord.gg/c5QtDB9RAP 🕹️ Free AI-Powered Code Refactoring with Sourcery: https://sourcery.ai/?utm_source=YouTub&utm_campaign=JBriggs&utm_medium=aff 00:00 Intro 01:37 TF-IDF 11:44 BM25 20:30 SBERT
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from James Briggs · James Briggs · 45 of 60

1 Stoic Philosophy Text Generation with TensorFlow
Stoic Philosophy Text Generation with TensorFlow
James Briggs
2 How to Build TensorFlow Pipelines with tf.data.Dataset
How to Build TensorFlow Pipelines with tf.data.Dataset
James Briggs
3 Every New Feature in Python 3.10.0a2
Every New Feature in Python 3.10.0a2
James Briggs
4 How-to Build a Transformer for Language Classification in TensorFlow
How-to Build a Transformer for Language Classification in TensorFlow
James Briggs
5 How-to use the Kaggle API in Python
How-to use the Kaggle API in Python
James Briggs
6 Language Generation with OpenAI's GPT-2 in Python
Language Generation with OpenAI's GPT-2 in Python
James Briggs
7 Text Summarization with Google AI's T5 in Python
Text Summarization with Google AI's T5 in Python
James Briggs
8 How-to do Sentiment Analysis with Flair in Python
How-to do Sentiment Analysis with Flair in Python
James Briggs
9 Python Environment Setup for Machine Learning
Python Environment Setup for Machine Learning
James Briggs
10 Sequential Model - TensorFlow Essentials #1
Sequential Model - TensorFlow Essentials #1
James Briggs
11 Functional API - TensorFlow Essentials #2
Functional API - TensorFlow Essentials #2
James Briggs
12 Training Parameters - TensorFlow Essentials #3
Training Parameters - TensorFlow Essentials #3
James Briggs
13 Input Data Pipelines - TensorFlow Essentials #4
Input Data Pipelines - TensorFlow Essentials #4
James Briggs
14 6 of Python's Newest and Best Features (3.7-3.9)
6 of Python's Newest and Best Features (3.7-3.9)
James Briggs
15 Novice to Advanced RegEx in Less-than 30 Minutes + Python
Novice to Advanced RegEx in Less-than 30 Minutes + Python
James Briggs
16 Building a PlotLy $GME Chart in Python
Building a PlotLy $GME Chart in Python
James Briggs
17 How-to Use The Reddit API in Python
How-to Use The Reddit API in Python
James Briggs
18 How to Build Custom Q&A Transformer Models in Python
How to Build Custom Q&A Transformer Models in Python
James Briggs
19 How to Build Q&A Models in Python (Transformers)
How to Build Q&A Models in Python (Transformers)
James Briggs
20 How-to Decode Outputs From NLP Models (Python)
How-to Decode Outputs From NLP Models (Python)
James Briggs
21 Identify Stocks on Reddit with SpaCy (NER in Python)
Identify Stocks on Reddit with SpaCy (NER in Python)
James Briggs
22 Sentiment Analysis on ANY Length of Text With Transformers (Python)
Sentiment Analysis on ANY Length of Text With Transformers (Python)
James Briggs
23 Unicode Normalization for NLP in Python
Unicode Normalization for NLP in Python
James Briggs
24 The NEW Match-Case Statement in Python 3.10
The NEW Match-Case Statement in Python 3.10
James Briggs
25 Multi-Class Language Classification With BERT in TensorFlow
Multi-Class Language Classification With BERT in TensorFlow
James Briggs
26 How to Build Python Packages for Pip
How to Build Python Packages for Pip
James Briggs
27 How-to Structure a Q&A ML App
How-to Structure a Q&A ML App
James Briggs
28 How to Index Q&A Data With Haystack and Elasticsearch
How to Index Q&A Data With Haystack and Elasticsearch
James Briggs
29 Q&A Document Retrieval With DPR
Q&A Document Retrieval With DPR
James Briggs
30 How to Use Type Annotations in Python
How to Use Type Annotations in Python
James Briggs
31 Extractive Q&A With Haystack and FastAPI in Python
Extractive Q&A With Haystack and FastAPI in Python
James Briggs
32 Sentence Similarity With Sentence-Transformers in Python
Sentence Similarity With Sentence-Transformers in Python
James Briggs
33 Sentence Similarity With Transformers and PyTorch (Python)
Sentence Similarity With Transformers and PyTorch (Python)
James Briggs
34 NER With Transformers and spaCy (Python)
NER With Transformers and spaCy (Python)
James Briggs
35 Training BERT #1 - Masked-Language Modeling (MLM)
Training BERT #1 - Masked-Language Modeling (MLM)
James Briggs
36 Training BERT #2 - Train With Masked-Language Modeling (MLM)
Training BERT #2 - Train With Masked-Language Modeling (MLM)
James Briggs
37 Training BERT #3 - Next Sentence Prediction (NSP)
Training BERT #3 - Next Sentence Prediction (NSP)
James Briggs
38 Training BERT #4 - Train With Next Sentence Prediction (NSP)
Training BERT #4 - Train With Next Sentence Prediction (NSP)
James Briggs
39 FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
James Briggs
40 New Features in Python 3.10
New Features in Python 3.10
James Briggs
41 Training BERT #5 - Training With BertForPretraining
Training BERT #5 - Training With BertForPretraining
James Briggs
42 How-to Use HuggingFace's Datasets - Transformers From Scratch #1
How-to Use HuggingFace's Datasets - Transformers From Scratch #1
James Briggs
43 Build a Custom Transformer Tokenizer - Transformers From Scratch #2
Build a Custom Transformer Tokenizer - Transformers From Scratch #2
James Briggs
44 3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
James Briggs
3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
James Briggs
46 Building MLM Training Input Pipeline - Transformers From Scratch #3
Building MLM Training Input Pipeline - Transformers From Scratch #3
James Briggs
47 Training and Testing an Italian BERT - Transformers From Scratch #4
Training and Testing an Italian BERT - Transformers From Scratch #4
James Briggs
48 Faiss - Introduction to Similarity Search
Faiss - Introduction to Similarity Search
James Briggs
49 Angular App Setup With Material - Stoic Q&A #5
Angular App Setup With Material - Stoic Q&A #5
James Briggs
50 Why are there so many Tokenization methods in HF Transformers?
Why are there so many Tokenization methods in HF Transformers?
James Briggs
51 Choosing Indexes for Similarity Search (Faiss in Python)
Choosing Indexes for Similarity Search (Faiss in Python)
James Briggs
52 Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
James Briggs
53 How LSH Random Projection works in search (+Python)
How LSH Random Projection works in search (+Python)
James Briggs
54 IndexLSH for Fast Similarity Search in Faiss
IndexLSH for Fast Similarity Search in Faiss
James Briggs
55 Faiss - Vector Compression with PQ and IVFPQ (in Python)
Faiss - Vector Compression with PQ and IVFPQ (in Python)
James Briggs
56 Product Quantization for Vector Similarity Search (+ Python)
Product Quantization for Vector Similarity Search (+ Python)
James Briggs
57 How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
James Briggs
58 Metadata Filtering for Vector Search + Latest Filter Tech
Metadata Filtering for Vector Search + Latest Filter Tech
James Briggs
59 Build NLP Pipelines with HuggingFace Datasets
Build NLP Pipelines with HuggingFace Datasets
James Briggs
60 Composite Indexes and the Faiss Index Factory
Composite Indexes and the Faiss Index Factory
James Briggs

Related AI Lessons

Why you shouldn’t search your documents directly with AI
Learn why directly searching documents with AI can be inefficient and how retrieval-augmented systems can improve the process
Medium · Programming
Your AI Keeps Making Things Up. RAG Is How You Make It Use Real Facts Instead.
Learn how to use RAG to make your AI provide accurate answers based on real facts instead of making things up
Medium · RAG
Evaluation Metrics for RAG: Measure Retrieval, Generation, and End-to-End Quality With Numbers That…
Learn to evaluate RAG models using metrics that measure retrieval, generation, and end-to-end quality
Medium · AI
Evaluation Metrics for RAG: Measure Retrieval, Generation, and End-to-End Quality With Numbers That…
Learn to evaluate RAG models using metrics that measure retrieval, generation, and end-to-end quality
Medium · Data Science

Chapters (4)

Intro
1:37 TF-IDF
11:44 BM25
20:30 SBERT
Up next
RRF vs DBSF with Qdrant: Hybrid Retrieval Fusion for RAG in Python
Professor Py: AI Engineering
Watch →