IndexLSH for Fast Similarity Search in Faiss

James Briggs · Beginner ·🔍 RAG & Vector Search ·4y ago

Skills: Vector Stores90%RAG Evaluation80%Advanced RAG70%

Key Takeaways

The video demonstrates the use of IndexLSH in Faiss for fast similarity search, showcasing its implementation, configuration, and performance evaluation. It covers the initialization of IndexLSH, searching for k nearest neighbors, and visualizing the distribution of vectors in each bucket.

Full Transcript

hi and welcome to the video we're going to be covering how we can implement lsh in physe now we covered lsh and how it works using random projection in a previous video which i'm going to release at the same time as this video so i'll link to that in the description if you want to understand how lsh works but in this video we're just going to cover how we implement it in vice so we're going to be using the sift 1m data set which you can download using i'll put a link to the script in the description but it's just a big data set of dense vectors now from this we'll get our 1 million samples xp and we also have our queries which i'm going to just we're just going to use one query at a time okay so we'll have one query one million samples and we perform our search using that so what we first want to do is import files i'm going to get my dimensionality which is is it xb dot shape and 1 i think d yep 128 so that's our dimensionality of each vector within the sift 1m data set and initially let's use an n bits value of 4 which gives us if we we think about it how many how many buckets does that give us we can do this right so it gives us a total of 16 buckets okay these are also potential buckets our vectors aren't necessarily going to be separated between all of them they might we might end up only using like 10 of those buckets for example okay and then what we want to do is initialize our index so first index lsh and in here you need to pass our dimensionality and also our n bits value so how many binary values we're going to include within each bucketed item so do we first thing it's always good to check do we need to train the index indexes train is true so that means we don't need to train it there's no optimizations or anything we need to do here so that's good don't need to worry about that and then so without needing training all we do is index add and we add our data okay and now we're actually just ready to go ahead and search so we'll do d i equals index.search xq and let's return let's set k up here actually so let's set k equal to 10 okay so do k [Music] and then we get the the these are the apparently the most similar indexes to our query okay and there's one good thing to check here which makes it really easy to figure out if we have enough buckets or not is d which is the distance between those vectors now obviously if we have all of our vectors in a single bucket and we we search and our query ends up in the same bucket all of the distances here are going to be zero because they're all going to be exactly the same uh hamming distance which is what we're returning so let's look and that's that's what we get since of i mean we have very few uh numbers here so that's not that surprising let's push k up to 10k you'd think at this point you know surely would get someone's i don't know so at least the first 10 000 nearest neighbors are all within one bucket and so it clearly and there's an issue here we can't differentiate between all of those but if we if we just increase this to a very high number let's say like 500 000 so this is returning half of the index at this point we should i imagine see we see a few ones so at some point within the first 5 000 items we do actually um we we move from a single bucket to at least one other bucket now if we think about it we can we can actually estimate roughly how what n bits value we need in order to kind of get a better distribution uh between our buckets so to not have everything in in a single bucket so let's try these we have uh for n bits in in these values let's give it a go see what we get um and we see okay embassy equals two so this gives them gives us an average of 250 000 um items within each bucket so obviously that's not useful uh here we get 62k again not used to 4k 15 okay that's a lot better right but it's still not useful because we kind of want you know if we if we set k equal to 10 um this isn't really going to give us anything particularly useful so and and we also need to consider the fact that we in our buckets not all of them are going to get used and what we will find is that it's almost like there's there will be a majority of or there will be a minority of buckets that end up with the majority of vectors so we we need to go sub one uh for this to to work so i i would say something probably n bits of 24 we're getting that i mean that value actually looks pretty good 0.05 so on average we should get 0.05 items within each bucket okay so i mean let's let's test that so we're going to copy this and we'll just write a similar code to what we wrote before although one thing before we do that let's just check uh the average similarity let's say should we change k to like 100 or 10 let's let's go with 10 yeah let's do 10. and what i want to do is given that i want to calculate the the cosine similarity average cosine similarity so we'll do that we use from sk learn metrics pairwise import cosine similarity this is just super quick way for us to calculate and then check um s k learn and we'll do cosine similarity and we have i zero xq sure oh so sorry we need to uh so we need to get there the indexes so i i just contains the indexes we need to we need to pull them out of xb okay and we have this so we also need to take the mean of that so cos mean okay 0.4 which is just the average of those the average similarity of those so it's pretty low let's compare that to if we did that for the whole um the whole data set so like this see what we get okay so actually these are yeah i mean these are slightly more similar um so we even you know get slightly better results even even with that terrible sort of bucketing and so what i wanted to do is is repeat that process we just did there but i want to do it for different end bits values so we need to take our index bring it down here and we're going to say this we do index.add our data and then we do d i equals index search as we as we usually do uh k is 10 and then we want to do cos equals cosine similarity again we're going to do x b i so we're pulling out those indexes that we've got um we're doing this different n bits values hence the hence loop and hq okay and then what we're going to do is just print n bits is equal to whatever it's equal to at that point and then new line i'm going to put um what is similarity so the similarity is equal to cos dot mean like that let's just see what we what we get so okay cool so so we look here and so n bits of two is is terrible gets better it gets better and then at 24 so that bit where we have the 0.05 we get the maximum value so we get a 7 3 here 7 4 which is the best we get and then it kind of comes off a little bit here and i've kind of visualized that here so this is the same um so sure i've gone a little bit high with with n bits as well and this is essentially the sort of structure we get so the bit that we just kind of saw was this section here where we're increasing n bits and reducing the number of or reducing like heavily reducing the number of vectors within each bucket on average and the coastline similarity just shoots up really quick but then afterwards we still get this kind of slow increase and that's because so the buckets are now quite spread like the the vectors are spread very nicely between different buckets but as we increase the resolution even further and further all it means is that we're adding more of the original information the original positional information of those vectors and essentially improving the resolution of those vectors so they're just getting slightly more accurate in the whether the hash buckets are closer to other very similar or not similar vectors now essentially the binary vectors are more heavily representing uh the original dense vectors and that's why we see that slight increase that continues even after we get to to ambits of 32. now one final thing i just wanted to quickly cover um is say we do have issues and and we want to sort of visualize the actual buckets themselves you know how how can we how can we extract those buckets and and view and view the number of vectors that we have in each one now there isn't any any way of doing this in in um in face so you we kind of have to put our own code together so what we need to do is i'm going to create first create another index because i want to visualize these so i'm going to do and if i visualize it with 10 bits of 32 there's going to be well how many how many buckets would we have there it would be 2 to the power of 32 so yeah we can't visualize that many buckets so we're going to go with 4 and then bits of 4 which gives us 16 buckets so index lsh we have d and we're going to use a n bits value of four n bits equals four just put that there okay we're going to add our vectors and now what we want to do is actually extract those binary vectors themselves so to do that we write array so we're going to sort them in in the array variable and we do fice dot vector to array like this and in here we do index dot codes so codes um just that word there is essentially that means you know the binary vectors that we've built they are called codes that's all it means but and we use it this word a lot in in similarity search so it's worth remembering also n bits is used a lot as well so in future videos we will uh in this series we you will see them uh quite a lot so then let's have a look at what we have we have okay so we wanted to pull out the binary codes and we're getting these uh these numbers which is not exactly what we expect or or is it right um well okay let's have a look at the min value here zero okay array max okay so we get 0 to 15 which means we get 16 values and okay let's do n to the power of n bits okay so that means we have 16 unique buckets that all of our vectors are spread between so that's interesting let's see how many vectors we have we have one million so we have one million spread between 16 buckets so actually that sounds about right but how do we so these these values here these are just the integer versions of our binary vectors so um well like this is represented by zero and [Music] this is represented by one right so what we can do is use this little little script here to convert them into the actual binary vectors so here this this value is our first one which is a five and the next one is a 12 we should see here so i mean that's that is our there are binary vectors and we can we can visualize how they are are spread um how many items are within each one of those buckets using this array and what we'll get is something that looks like this um so this is what i mean where we have so we have 16 buckets which you can see on the left here so all of these with 16 of those and not all of them even have any any vectors in so we have so we have one two three four that don't even contain any vectors and then the majority the vast majority are contained within uh these and these right so so these four buckets contain almost everything and they have loads they have uh so the count is to the thousands there so here that number add a few more zeros on to the end um so yeah the these four buckets are far too many far too many vectors in for it to be useful so that's why we before we increase it uh increase our in bits value up to 24. now i think i mean that's basically everything i wanted to cover but just be quickly in terms of performance um so visualize these using this is using device l h s l s h index um so we have n bits so obviously we just increase n bits loads uh we increase the recoil performance so that's one thing about lsh is that to get good performance you really have to increase the end bits value a lot and at that point it actually gets it can get really inefficient um so this is a the time for each query as compared to a flight index so obviously i mean you you kind of want to be careful with a lsh index obviously it can be useful um but you know high dimensionality is difficult to balance well with with speed and accuracy but obviously you know it can be done it just depends on how much accuracy you're willing to sacrifice because in most cases with lsh you will be sacrificing a reasonable amount of accuracy so the recoil here is about 50 and so that's uh it's it's managing to to identify 50 around here of the same items that a flat index would would identify and it's only slightly faster so yeah you just have to weigh you know what is most important for your use case with that but anyway that's it for lsh um this book this is the last video we will cover for lsh for now in the next few videos and articles we're going to be covering a few different indexes and and how we use them effectively so it should be pretty interesting but we're definitely getting a bit more advanced than lsh now so it should be good but that's it for this video thank you very much for watching and i'll see you in the next one

Original Description

Faiss - or Facebook AI Similarity Search - is an open-source framework built for enabling similarity search. Faiss has many super-efficient implementations of different indexes that we can use in similarity search. That long list of indexes includes IndexLSH - an easy-to-use implementation of everything we have covered so far in LSH. 🌲 Pinecone article: https://www.pinecone.io/learn/locality-sensitive-hashing-random-projection/ Download Sift1M: https://gist.github.com/jamescalam/a09a16c17b677f2cf9c019114711f3bf How LSH Random Projection works in search (+Python): https://youtu.be/8bOrMqEdfiQ 🤖 70% Discount on the NLP With Transformers in Python course: https://bit.ly/3DFvvY5 🎉 Sign-up For New Articles Every Week on Medium! https://medium.com/@jamescalam/membership 👾 Discord: https://discord.gg/c5QtDB9RAP 🕹️ Free AI-Powered Code Refactoring with Sourcery: https://sourcery.ai/?utm_source=YouTub&utm_campaign=JBriggs&utm_medium=aff

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from James Briggs · James Briggs · 54 of 60

← Previous Next →

Stoic Philosophy Text Generation with TensorFlow

Stoic Philosophy Text Generation with TensorFlow

How to Build TensorFlow Pipelines with tf.data.Dataset

How to Build TensorFlow Pipelines with tf.data.Dataset

Every New Feature in Python 3.10.0a2

Every New Feature in Python 3.10.0a2

How-to Build a Transformer for Language Classification in TensorFlow

How-to Build a Transformer for Language Classification in TensorFlow

How-to use the Kaggle API in Python

How-to use the Kaggle API in Python

Language Generation with OpenAI's GPT-2 in Python

Language Generation with OpenAI's GPT-2 in Python

Text Summarization with Google AI's T5 in Python

Text Summarization with Google AI's T5 in Python

How-to do Sentiment Analysis with Flair in Python

How-to do Sentiment Analysis with Flair in Python

Python Environment Setup for Machine Learning

Python Environment Setup for Machine Learning

Sequential Model - TensorFlow Essentials #1

Sequential Model - TensorFlow Essentials #1

Functional API - TensorFlow Essentials #2

Functional API - TensorFlow Essentials #2

Training Parameters - TensorFlow Essentials #3

Training Parameters - TensorFlow Essentials #3

Input Data Pipelines - TensorFlow Essentials #4

Input Data Pipelines - TensorFlow Essentials #4

6 of Python's Newest and Best Features (3.7-3.9)

6 of Python's Newest and Best Features (3.7-3.9)

Novice to Advanced RegEx in Less-than 30 Minutes + Python

Novice to Advanced RegEx in Less-than 30 Minutes + Python

Building a PlotLy $GME Chart in Python

Building a PlotLy $GME Chart in Python

How-to Use The Reddit API in Python

How-to Use The Reddit API in Python

How to Build Custom Q&A Transformer Models in Python

How to Build Custom Q&A Transformer Models in Python

How to Build Q&A Models in Python (Transformers)

How to Build Q&A Models in Python (Transformers)

How-to Decode Outputs From NLP Models (Python)

How-to Decode Outputs From NLP Models (Python)

Identify Stocks on Reddit with SpaCy (NER in Python)

Identify Stocks on Reddit with SpaCy (NER in Python)

Sentiment Analysis on ANY Length of Text With Transformers (Python)

Sentiment Analysis on ANY Length of Text With Transformers (Python)

Unicode Normalization for NLP in Python

Unicode Normalization for NLP in Python

The NEW Match-Case Statement in Python 3.10

The NEW Match-Case Statement in Python 3.10

Multi-Class Language Classification With BERT in TensorFlow

Multi-Class Language Classification With BERT in TensorFlow

How to Build Python Packages for Pip

How to Build Python Packages for Pip

How-to Structure a Q&A ML App

How-to Structure a Q&A ML App

How to Index Q&A Data With Haystack and Elasticsearch

How to Index Q&A Data With Haystack and Elasticsearch

Q&A Document Retrieval With DPR

Q&A Document Retrieval With DPR

How to Use Type Annotations in Python

How to Use Type Annotations in Python

Extractive Q&A With Haystack and FastAPI in Python

Extractive Q&A With Haystack and FastAPI in Python

Sentence Similarity With Sentence-Transformers in Python

Sentence Similarity With Sentence-Transformers in Python

Sentence Similarity With Transformers and PyTorch (Python)

Sentence Similarity With Transformers and PyTorch (Python)

NER With Transformers and spaCy (Python)

NER With Transformers and spaCy (Python)

Training BERT #1 - Masked-Language Modeling (MLM)

Training BERT #1 - Masked-Language Modeling (MLM)

Training BERT #2 - Train With Masked-Language Modeling (MLM)

Training BERT #2 - Train With Masked-Language Modeling (MLM)

Training BERT #3 - Next Sentence Prediction (NSP)

Training BERT #3 - Next Sentence Prediction (NSP)

Training BERT #4 - Train With Next Sentence Prediction (NSP)

Training BERT #4 - Train With Next Sentence Prediction (NSP)

FREE 11 Hour NLP Transformers Course (Next 3 Days Only)

FREE 11 Hour NLP Transformers Course (Next 3 Days Only)

New Features in Python 3.10

New Features in Python 3.10

Training BERT #5 - Training With BertForPretraining

Training BERT #5 - Training With BertForPretraining

How-to Use HuggingFace's Datasets - Transformers From Scratch #1

How-to Use HuggingFace's Datasets - Transformers From Scratch #1

Build a Custom Transformer Tokenizer - Transformers From Scratch #2

Build a Custom Transformer Tokenizer - Transformers From Scratch #2

3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)

3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

Building MLM Training Input Pipeline - Transformers From Scratch #3

Building MLM Training Input Pipeline - Transformers From Scratch #3

Training and Testing an Italian BERT - Transformers From Scratch #4

Training and Testing an Italian BERT - Transformers From Scratch #4

Faiss - Introduction to Similarity Search

Faiss - Introduction to Similarity Search

Angular App Setup With Material - Stoic Q&A #5

Angular App Setup With Material - Stoic Q&A #5

Why are there so many Tokenization methods in HF Transformers?

Why are there so many Tokenization methods in HF Transformers?

Choosing Indexes for Similarity Search (Faiss in Python)

Choosing Indexes for Similarity Search (Faiss in Python)

Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)

Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)

How LSH Random Projection works in search (+Python)

How LSH Random Projection works in search (+Python)

IndexLSH for Fast Similarity Search in Faiss

IndexLSH for Fast Similarity Search in Faiss

Faiss - Vector Compression with PQ and IVFPQ (in Python)

Faiss - Vector Compression with PQ and IVFPQ (in Python)

Product Quantization for Vector Similarity Search (+ Python)

Product Quantization for Vector Similarity Search (+ Python)

How to Build a Bert WordPiece Tokenizer in Python and HuggingFace

How to Build a Bert WordPiece Tokenizer in Python and HuggingFace

Metadata Filtering for Vector Search + Latest Filter Tech

Metadata Filtering for Vector Search + Latest Filter Tech

Build NLP Pipelines with HuggingFace Datasets

Build NLP Pipelines with HuggingFace Datasets

Composite Indexes and the Faiss Index Factory

Composite Indexes and the Faiss Index Factory

This video teaches how to use IndexLSH in Faiss for fast similarity search, including its implementation, configuration, and performance evaluation. It covers the trade-offs between speed and accuracy in IndexLSH and provides practical steps for applying IndexLSH to real-world problems.

Key Takeaways

Initialize IndexLSH with dimensionality and n bits value
Add data to IndexLSH
Search for k nearest neighbors using IndexLSH
Increase n bits value to get a better distribution between buckets
Calculate average similarity using cosine similarity
Repeat process for different n bits values
Print n bits value and corresponding cosine similarity value
Create an IndexLSH with a specified number of bits
Add vectors to the IndexLSH
Extract the binary vectors from the IndexLSH

💡 IndexLSH can be useful for fast similarity search, but it sacrifices accuracy for speed, and high dimensionality is difficult to balance with speed and accuracy.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Vector Stores

View skill →

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

🚀 Deploy a PRIVATE Chroma Vector DB to AWS | Step by step 🚀

🚀 Deploy a PRIVATE Chroma Vector DB to AWS | Step by step 🚀

AI-Powered Resumes with Super People & Weaviate

AI-Powered Resumes with Super People & Weaviate

Weaviate vector database

Build Advanced Retrieval-Augmented Generation (RAG) with MongoDB Vector Search

Build Advanced Retrieval-Augmented Generation (RAG) with MongoDB Vector Search

Creating & Ingesting Your Own Embeddings in Weaviate | Vector Databases for Beginners | Part 7

Creating & Ingesting Your Own Embeddings in Weaviate | Vector Databases for Beginners | Part 7

Data Science Dojo

Configuring Vector Search in AlloyDB

Related Reads

Solving One of the Hardest Problems in Code RAG: Context Retrieval

Learn to solve context retrieval in code RAG systems, a crucial challenge in automation code generation, and improve your skills in RAG and code analysis.

Practical RAG, Part 1: The Simplest RAG That Actually Works

Learn to build a simple Retrieval-Augmented Generation pipeline from scratch in Python and understand its limitations

Dev.to · Suman Nath

What Is Retrieval-Augmented Generation (RAG)? A Complete Guide for Businesses

Learn how Retrieval-Augmented Generation (RAG) helps businesses get accurate answers from AI, moving beyond confident guesses

I Built a Biomedical RAG System, and a 40-Year-Old Algorithm Beat My Vector Database

Learn how to build a biomedical RAG system and evaluate its performance against a 40-year-old algorithm

Dev.to · Oluwagbade Odimayo

RRF vs DBSF with Qdrant: Hybrid Retrieval Fusion for RAG in Python

Professor Py: AI Engineering