3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

James Briggs · Beginner ·🔍 RAG & Vector Search ·5y ago

Key Takeaways

This video covers three vector-based methods for similarity search: TF-IDF, BM25, and SBERT

Full Transcript

we're going to cover three different vector-based similarity search methods we're going to explain how they work and try and get an intuition for why they work and we're also going to actually go ahead and implement each of these in python now for tf idf and bm25 they're both pretty similar and bm25 is actually a improved version of tfidf and we can classify both of these as being sparse vector methods so these both use sparse vectors which essentially means they're vectors with a lot of zeros in with the occasional value in there and then our bottom one down here which is sentence bert that's an example of a dense vector now dense vectors are quite interesting as they allow us to consider semantics or the meaning behind what we are saying rather than just the syntax and the words that we're using so that that's pretty interesting but all of these can work better than the others in in different situations i've had dense representations of language in some cases work worse than sparse representations and of course the other way around as well so we'll but we'll have a look and we'll cover all of those so let's get started with tf idf now tf idf consists of two different components as you can see the name tf idf now the first of those components is the tf which is the term frequency and term frequency does what it says on the tin it looks at a sentence or a paragraph and given a certain query which is the queue here it will tell you compared to the length of that document how frequent your query is so what we have here is this d which is our document so we're saying okay what is the frequency of the query queue in our document d then on the bottom down here we're saying what is the frequency of all terms in our document d so if we were to calculate this we would get 1 because we only have our one our query in this case is bananas divided by 18 which is the total number of terms in that document and that will give us 0.0 so that's our term frequency which is half of tf idf we still have idf so what what is idf how do we how do we calculate that so idf is the inverse document frequency and it's calculated for each document so each so when we say document here we mean it can be a sentence or a paragraph essentially what you see here we have abc these are on those are documents in this case each one of them is a sentence it could be a paragraph it could be a mix of different things right in this case we have these sentences and what we're saying here is the inverse document frequency is the log of the number of documents so this in this case we would have three over the number of documents that contain the query is or the word is which in this case is all of them so we have all three documents there and so what we get from there is the value one now don't forget that this is a log here so it's not actually one it's if i left myself a little bit space it's log and then and then one log one which is equal to zero so is is a very common word and because it's so common we literally do not care about it and this idf is multiplied by our term frequency so we would have tf down here so if our term that we were searching for was is then all of these sequences would get a value of zero which might seem counterintuitive but if your if your query is something like is is the best city in the forest right you know from the from the top sentence there you wouldn't want to be prioritizing the word is or the or in because those words are everywhere so we don't really care about those ones we want to care about the more unique words and that's what you would get down here so we have a query for forest the number of documents exactly the same still three and how many the word forest so only one of them we have a at the top so we would have log three and what is log three log three is zero 0.48 so then we times that by tf and we get a larger number than we would for is so that is where the idf or inverse document frequency comes in it essentially multiply gives us a multiplier for less common words because chances are they're more relevant to our query now what we can see here is just a work through of what we just looked at so we're calculating the tf for each document and the word is up here we calculate the idf and then we multiply those two together and because is is everywhere the idf is e equals zero and therefore the outcome is always zero because it's just not a relevant word that we care about now on the other hand the word forest when we calculate when we calculate the tf idf for each of these the tf for b and c is zero because the term just doesn't appear in those documents and then the idf is higher because forest is a rarer word so it's 0.48 now the outcome of that is that only this value here so this is the tf idf for the document a is not zero and that that's essentially how tf idf works now let's jump across to python let's have a quick look at how we'd write down code so we have we have those three sentences again up here so let's run that and then just here we can see okay we import numpy we're using numpy here we're merging all the documents here now the reason we do this so dimers into a list of lists is just so we can more easily calculate the terms n and also the term n q equals our query which you can see just here in the idf section now once we do get to tf idf here so we calculate the idf which i just explained anyway and then also the term frequency which is just the number of matches for our specific word in a given sentence so let's run that and let's have a look tf idf for the word let's go over is like we did before in a what do we get we get zero as we as we calculated before and how about forests and here we get the 0.06 obviously it's more precise here but before we got 0.06 in our other screen now that's tf idf reasonably straightforward however i did say that these are vectors right these are sparse vectors obviously this doesn't look much like a vector so how do we turn this into a vector well what we do is we take our vocab so the all of the words that we have in all of our documents so a b and c so if we so we create our vocab and all we want to do is we take a set of our three documents a plus b plus c and this is just going to create a set containing all of the words across all of our documents so so quick look at what we what we get from there vocabulary is just all every single word that we have in there now to create our vector what we want to do is take that vocab and mirror it into a vector so for every word here we're going to calculate the tf idf for each document and saw that in a list which creates our tf idf vector so let me show you how that works so let's first initialize our vector so vector a and what we're going to do is say for word in vocab vector a dot append and then here we're going to call tf idf and then we have our word so we define that as word and our sentence in this case we're doing it for sentence a so to write sentence a like that and we can do this for each of our vectors let's just do a and b for now so b here and here and here okay and let's have a look what we get so we have vector a and we see that we get essentially a vector now we don't have that many documents or that many words here so we're just getting the same number here which is meaning that it appears in one document and that document is is a now let's have a look at b let me see we get a few different values here and that that's how that's our tf idf vector that's how we build it so that's tf idf let's move on to bm25 now bm 25 is a optimized version of tf idf so one of the problems of tf idf is as the frequency of queries found in a document increase the score increased linearly so imagine you have an article and in that 1000 word article it has the word dog maybe ten times now there's a good chance that uh that article was talking about dogs right now if you double the number of words that mention that are dog to 20 the tf idf score for that document will double as well and that if you think about it logically is a document or an article that has the word dog 10 times half as relevant as this same article that has the word dog 20 times probably not maybe it's maybe is slightly less relevant but not quite that much and that's why bm25 comes in it almost normalizes that value and we'll we'll visualize that soon as well but let's explain this because it looks pretty horrific now oh here at the top we have very similar to tfidf we have our essentially term frequency here and then we have our idf here now in our term frequency we can see that we have these two values here and this is the frequency of the query frequency of terms and this is pulled it's straight from tf idf but then we have a few new terms around here so we have k which you can also see here as well now k and b here are adjustable special parameters k would typically be in the range of 1.25 by default and b around 0.75 and we can modify these base on our use case to optimize the algorithm now another new one that we have over here is this d average now this d here that is the current document length so our d length and then over here we have the average document like so that's all of our documents so it's all of the documents their length divided by the number of documents so that would be n now that is the tf component and how it's been modified and then let's have a look at the idf component now the idf component we see these terms here and here which are exactly the same as the two terms that we have here so it's the number of documents versus the number of documents that contain that query and then over here we just have a few constant values so that although bm the bm25 algorithm looks pretty complex it's not that much different from the etf idf algorithm so let's go ahead and implement that in python now over in python we have these sentences or documents here we're going to add the words together in dots just makes it a little bit easier for us and we still have this tf idf function now i want to look at this and compare it to what we would do for bm25 so the first thing is let me just make this a little more readable so sentence account here we turn it into frequency okay so change that to frequency like so now we do the same here we sum in our frequency and then this is our term frequency component of tf idf and bm25 now let's have a look at these so we have k and b those are our adjustable parameters the only sort of new thing we have here really is the average dl which is the average length of all documents so how do we calculate that we just calculate outside function because this is for all documents it doesn't matter which sentence or which word we're looking at it's going to be the same for each one so we just sum the length of each sentence for every sentence in our dots in fact this here we can change that to two dots like so so we change that okay now let's let's just run it quickly see what we get so bm25 purple and in sentence b there is no purple so we would expect zero which is is what we get so we know we know this is at least working right so let's do the same for sentence a and we should get a score so 1.76 which is is what we'd expect so that that looks good now the other bit that is slightly different this is really quite simple is this bit here so the idf component now idf up here what do we do we are just taking the length of the documents dividing by the sum here and that's exactly what we do here so this here we could rewrite if we want to match 2d below as n underscore q like so and that's the only real difference the other part is so here we're using log to the base 10 and down here we're using the natural logarithm so there's a small there's also that difference as well but otherwise we're just adding the 0.5 and the ones and rearranging the n and the n eq and of course this here would be bn now that's bm25 and tf idf but what does that actually look like so when we compare these two algorithms well tf idf it's the score for tf idf increases linearly with the frequency for the number of words or number of matching terms so it goes up like this okay so that that's tf idf now bm 25 is slightly different bm 25 instead of going up like that it does something which looks more like this so it increases a lot quickly and then it kind of levels off so if you have you know four words for bm25 or you have eight like eight will become slightly more relevant the score will show that slightly more relevant but not that much now for tf idf the difference is the the relevance is doubled and you know depending on your use case maybe that is more along the lines of what you want but i do believe in most cases bm25 is probably more realistic so let's say two let's move on to our final one dense vectors with sentence bet now our final algorithm is expert so expert uses something called dense representations of language and what that means is rather than having a vector like we saw with tf idf and which is the same for bm25 where it has a lot of zeros and then the the odd number here and there which is called a sparse vector we have something which has many more values in there so there's there's essentially more information crammed into a smaller space and the effect of this is that you can represent language in a more meaningful manner so the word for example hi would be in a very similar space to the word hello and so would words like days of the week so monday tuesday and so on and the way it sentence works is that we have a transform model we have bert and our words or our query is process is it comes in through here it's processed by many bert encoding layers and then we get this dense vector now as well as our query so q so are our documents they are also processed through through the same encoder network and that produces dense vectors now once we have both of those dense vectors so we have this can be q and then we also have d we use cosine similarity between both of those to calculate how similar they are so how close to each other they are the angle between them and that works pretty much like this so we have over here we have sort of a lone vector and then over here we have two vectors which are much more similar they or at least they share the same direction and these two have a much smaller angle between them than either of those two do with this other vector which is all the way over there and that's essentially how cosine similarity works it just finds the angular angle between two vectors and where the angle is smaller they are more similar where the angle is greater they are less similar now we're not going to implement that from scratch because that would take a long time and be uh pretty hard to do to be honest so what we're going to do is use the sentence transformers library which is a very very good library that uses hugging faces transformers under the hood and has super easy implementations of sentence transformers which is essentially what we just saw but we're using it i was using the example of this specific one called sentence bet which is also what we're going to be using here so we'll run this we're using this bert base nli mean tokens model and we have all of our sentences here so first thing we need to do is we initialize our model here and then after that we encode all of our sentences with model.encode so we do that and we come down here so here we've we've encoded so what this has produced is the sentence embedding so once the text has been processed by albert model it outputs these sentence embeddings and they are vectors that represent the full sentence or the full document that we have input now once we get those out into our sentence embeddings variable there we use the cosine similarity function now here we're just going to use it we're going to use psychic lens implementation and what we're going to do because we have many sentence embeddings here let's let's just have a look so we have sentence embeddings let's have a look at the shape and you see that we have seven sets of embeddings here this here the 768 is the size of the of a single sentence embedding vector now we have seven why do we have seven because we have seven sentences up here so let's what we're going to do here is loop through each one and calculate the cosine similarity between that sentence and all of the other sentences so run that and we will get this array which shows us all of our scores between all of our sentences all possible combinations okay now that's pretty hard to look at so let's visualize it using matplotlib okay and we get this nice visual here so we can see here we have a on the left and it has a very high similarity to the vector a right so i haven't filtered those out or anything we are comparing the same vectors to the same vector and as you would expect they they get a very good score because they are exactly the same they all score one so there's no problem there we just ignore those ones in the middle now let's go on to the next one so we have we have these here so c and b okay those two seem to have very high similarities 0.72 super high so let's have a look so b and c now these are the two that both talk about throwing bananas or actually you know b talks about throwing bananas onto the street c talks about finding bananas on the street so yeah they're pretty similar right but they have a lot of the same words so fair enough tfidf and bm25 both identify those as being similar as well so that's it's good but it doesn't blow you away now how about b here and g so what i've done for g here is i've just taken the sentence b and swap the words around so that we don't really have matching words so throwing bananas instead of throwing i'm using word bombard instead of the word street i'm using road instead of bananas i'm using yellow fruit right so has a similar sort of meaning but they're not using the same word so tf idf and bm25 would would rarely struggle here now let's go down and let's hold up so b and g so we have b here g at the end and yeah they they have a really good score 0.66 says their second highest behind b and c so that just goes to show that sentence bur or sentence transformers in general don't require the same words to use they rely more on the semantic meaning of those words which is i think incredibly cool so that's it for this video we've covered quite a lot the sparse vectors tf idf and bm25 and also dense vector sensors bear or sentence transformers in general so i hope you've enjoyed the video thank you for watching and i'll see you in the next

Original Description

Vector similarity search is one of the fastest-growing domains in AI and machine learning. At its core, it is the process of matching relevant pieces of information together. Similarity search is a complex topic and there are countless techniques for building effective search engines. In this video, we'll cover three vector-based approaches for comparing languages and identifying similar 'documents', covering both vector similarity search and semantic search: - TF-IDF - BM25 - Sentence-BERT 📰 Original article: https://www.pinecone.io/learn/semantic-search/ 🤖 70% Discount on the NLP With Transformers in Python course: https://bit.ly/3DFvvY5 🎉 Sign-up For New Articles Every Week on Medium! https://medium.com/@jamescalam/membership Mining Massive Datasets Book (Similarity Search): 📚 https://amzn.to/3CC0zrc (3rd ed) 📚 https://amzn.to/3AtHSnV (1st ed, cheaper) 👾 Discord https://discord.gg/c5QtDB9RAP 🕹️ Free AI-Powered Code Refactoring with Sourcery: https://sourcery.ai/?utm_source=YouTub&utm_campaign=JBriggs&utm_medium=aff 00:00 Intro 01:37 TF-IDF 11:44 BM25 20:30 SBERT

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from James Briggs · James Briggs · 45 of 60

← Previous Next →

Stoic Philosophy Text Generation with TensorFlow

Stoic Philosophy Text Generation with TensorFlow

How to Build TensorFlow Pipelines with tf.data.Dataset

How to Build TensorFlow Pipelines with tf.data.Dataset

Every New Feature in Python 3.10.0a2

Every New Feature in Python 3.10.0a2

How-to Build a Transformer for Language Classification in TensorFlow

How-to Build a Transformer for Language Classification in TensorFlow

How-to use the Kaggle API in Python

How-to use the Kaggle API in Python

Language Generation with OpenAI's GPT-2 in Python

Language Generation with OpenAI's GPT-2 in Python

Text Summarization with Google AI's T5 in Python

Text Summarization with Google AI's T5 in Python

How-to do Sentiment Analysis with Flair in Python

How-to do Sentiment Analysis with Flair in Python

Python Environment Setup for Machine Learning

Python Environment Setup for Machine Learning

Sequential Model - TensorFlow Essentials #1

Sequential Model - TensorFlow Essentials #1

Functional API - TensorFlow Essentials #2

Functional API - TensorFlow Essentials #2

Training Parameters - TensorFlow Essentials #3

Training Parameters - TensorFlow Essentials #3

Input Data Pipelines - TensorFlow Essentials #4

Input Data Pipelines - TensorFlow Essentials #4

6 of Python's Newest and Best Features (3.7-3.9)

6 of Python's Newest and Best Features (3.7-3.9)

Novice to Advanced RegEx in Less-than 30 Minutes + Python

Novice to Advanced RegEx in Less-than 30 Minutes + Python

Building a PlotLy $GME Chart in Python

Building a PlotLy $GME Chart in Python

How-to Use The Reddit API in Python

How-to Use The Reddit API in Python

How to Build Custom Q&A Transformer Models in Python

How to Build Custom Q&A Transformer Models in Python

How to Build Q&A Models in Python (Transformers)

How to Build Q&A Models in Python (Transformers)

How-to Decode Outputs From NLP Models (Python)

How-to Decode Outputs From NLP Models (Python)

Identify Stocks on Reddit with SpaCy (NER in Python)

Identify Stocks on Reddit with SpaCy (NER in Python)

Sentiment Analysis on ANY Length of Text With Transformers (Python)

Sentiment Analysis on ANY Length of Text With Transformers (Python)

Unicode Normalization for NLP in Python

Unicode Normalization for NLP in Python

The NEW Match-Case Statement in Python 3.10

The NEW Match-Case Statement in Python 3.10

Multi-Class Language Classification With BERT in TensorFlow

Multi-Class Language Classification With BERT in TensorFlow

How to Build Python Packages for Pip

How to Build Python Packages for Pip

How-to Structure a Q&A ML App

How-to Structure a Q&A ML App

How to Index Q&A Data With Haystack and Elasticsearch

How to Index Q&A Data With Haystack and Elasticsearch

Q&A Document Retrieval With DPR

Q&A Document Retrieval With DPR

How to Use Type Annotations in Python

How to Use Type Annotations in Python

Extractive Q&A With Haystack and FastAPI in Python

Extractive Q&A With Haystack and FastAPI in Python

Sentence Similarity With Sentence-Transformers in Python

Sentence Similarity With Sentence-Transformers in Python

Sentence Similarity With Transformers and PyTorch (Python)

Sentence Similarity With Transformers and PyTorch (Python)

NER With Transformers and spaCy (Python)

NER With Transformers and spaCy (Python)

Training BERT #1 - Masked-Language Modeling (MLM)

Training BERT #1 - Masked-Language Modeling (MLM)

Training BERT #2 - Train With Masked-Language Modeling (MLM)

Training BERT #2 - Train With Masked-Language Modeling (MLM)

Training BERT #3 - Next Sentence Prediction (NSP)

Training BERT #3 - Next Sentence Prediction (NSP)

Training BERT #4 - Train With Next Sentence Prediction (NSP)

Training BERT #4 - Train With Next Sentence Prediction (NSP)

FREE 11 Hour NLP Transformers Course (Next 3 Days Only)

FREE 11 Hour NLP Transformers Course (Next 3 Days Only)

New Features in Python 3.10

New Features in Python 3.10

Training BERT #5 - Training With BertForPretraining

Training BERT #5 - Training With BertForPretraining

How-to Use HuggingFace's Datasets - Transformers From Scratch #1

How-to Use HuggingFace's Datasets - Transformers From Scratch #1

Build a Custom Transformer Tokenizer - Transformers From Scratch #2

Build a Custom Transformer Tokenizer - Transformers From Scratch #2

3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)

3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

Building MLM Training Input Pipeline - Transformers From Scratch #3

Building MLM Training Input Pipeline - Transformers From Scratch #3

Training and Testing an Italian BERT - Transformers From Scratch #4

Training and Testing an Italian BERT - Transformers From Scratch #4

Faiss - Introduction to Similarity Search

Faiss - Introduction to Similarity Search

Angular App Setup With Material - Stoic Q&A #5

Angular App Setup With Material - Stoic Q&A #5

Why are there so many Tokenization methods in HF Transformers?

Why are there so many Tokenization methods in HF Transformers?

Choosing Indexes for Similarity Search (Faiss in Python)

Choosing Indexes for Similarity Search (Faiss in Python)

Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)

Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)

How LSH Random Projection works in search (+Python)

How LSH Random Projection works in search (+Python)

IndexLSH for Fast Similarity Search in Faiss

IndexLSH for Fast Similarity Search in Faiss

Faiss - Vector Compression with PQ and IVFPQ (in Python)

Faiss - Vector Compression with PQ and IVFPQ (in Python)

Product Quantization for Vector Similarity Search (+ Python)

Product Quantization for Vector Similarity Search (+ Python)

How to Build a Bert WordPiece Tokenizer in Python and HuggingFace

How to Build a Bert WordPiece Tokenizer in Python and HuggingFace

Metadata Filtering for Vector Search + Latest Filter Tech

Metadata Filtering for Vector Search + Latest Filter Tech

Build NLP Pipelines with HuggingFace Datasets

Build NLP Pipelines with HuggingFace Datasets

Composite Indexes and the Faiss Index Factory

Composite Indexes and the Faiss Index Factory

Related AI Lessons

Why you shouldn’t search your documents directly with AI

Learn why directly searching documents with AI can be inefficient and how retrieval-augmented systems can improve the process

Medium · Programming

Your AI Keeps Making Things Up. RAG Is How You Make It Use Real Facts Instead.

Learn how to use RAG to make your AI provide accurate answers based on real facts instead of making things up

Evaluation Metrics for RAG: Measure Retrieval, Generation, and End-to-End Quality With Numbers That…

Learn to evaluate RAG models using metrics that measure retrieval, generation, and end-to-end quality

Evaluation Metrics for RAG: Measure Retrieval, Generation, and End-to-End Quality With Numbers That…

Learn to evaluate RAG models using metrics that measure retrieval, generation, and end-to-end quality

Medium · Data Science

Chapters (4)

Intro

1:37 TF-IDF

11:44 BM25

20:30 SBERT

RRF vs DBSF with Qdrant: Hybrid Retrieval Fusion for RAG in Python

Professor Py: AI Engineering