IndexLSH for Fast Similarity Search in Faiss

James Briggs · Beginner ·🔍 RAG & Vector Search ·4y ago

Key Takeaways

The video demonstrates the use of IndexLSH in Faiss for fast similarity search, showcasing its implementation, configuration, and performance evaluation. It covers the initialization of IndexLSH, searching for k nearest neighbors, and visualizing the distribution of vectors in each bucket.

Full Transcript

hi and welcome to the video we're going to be covering how we can implement lsh in physe now we covered lsh and how it works using random projection in a previous video which i'm going to release at the same time as this video so i'll link to that in the description if you want to understand how lsh works but in this video we're just going to cover how we implement it in vice so we're going to be using the sift 1m data set which you can download using i'll put a link to the script in the description but it's just a big data set of dense vectors now from this we'll get our 1 million samples xp and we also have our queries which i'm going to just we're just going to use one query at a time okay so we'll have one query one million samples and we perform our search using that so what we first want to do is import files i'm going to get my dimensionality which is is it xb dot shape and 1 i think d yep 128 so that's our dimensionality of each vector within the sift 1m data set and initially let's use an n bits value of 4 which gives us if we we think about it how many how many buckets does that give us we can do this right so it gives us a total of 16 buckets okay these are also potential buckets our vectors aren't necessarily going to be separated between all of them they might we might end up only using like 10 of those buckets for example okay and then what we want to do is initialize our index so first index lsh and in here you need to pass our dimensionality and also our n bits value so how many binary values we're going to include within each bucketed item so do we first thing it's always good to check do we need to train the index indexes train is true so that means we don't need to train it there's no optimizations or anything we need to do here so that's good don't need to worry about that and then so without needing training all we do is index add and we add our data okay and now we're actually just ready to go ahead and search so we'll do d i equals index.search xq and let's return let's set k up here actually so let's set k equal to 10 okay so do k [Music] and then we get the the these are the apparently the most similar indexes to our query okay and there's one good thing to check here which makes it really easy to figure out if we have enough buckets or not is d which is the distance between those vectors now obviously if we have all of our vectors in a single bucket and we we search and our query ends up in the same bucket all of the distances here are going to be zero because they're all going to be exactly the same uh hamming distance which is what we're returning so let's look and that's that's what we get since of i mean we have very few uh numbers here so that's not that surprising let's push k up to 10k you'd think at this point you know surely would get someone's i don't know so at least the first 10 000 nearest neighbors are all within one bucket and so it clearly and there's an issue here we can't differentiate between all of those but if we if we just increase this to a very high number let's say like 500 000 so this is returning half of the index at this point we should i imagine see we see a few ones so at some point within the first 5 000 items we do actually um we we move from a single bucket to at least one other bucket now if we think about it we can we can actually estimate roughly how what n bits value we need in order to kind of get a better distribution uh between our buckets so to not have everything in in a single bucket so let's try these we have uh for n bits in in these values let's give it a go see what we get um and we see okay embassy equals two so this gives them gives us an average of 250 000 um items within each bucket so obviously that's not useful uh here we get 62k again not used to 4k 15 okay that's a lot better right but it's still not useful because we kind of want you know if we if we set k equal to 10 um this isn't really going to give us anything particularly useful so and and we also need to consider the fact that we in our buckets not all of them are going to get used and what we will find is that it's almost like there's there will be a majority of or there will be a minority of buckets that end up with the majority of vectors so we we need to go sub one uh for this to to work so i i would say something probably n bits of 24 we're getting that i mean that value actually looks pretty good 0.05 so on average we should get 0.05 items within each bucket okay so i mean let's let's test that so we're going to copy this and we'll just write a similar code to what we wrote before although one thing before we do that let's just check uh the average similarity let's say should we change k to like 100 or 10 let's let's go with 10 yeah let's do 10. and what i want to do is given that i want to calculate the the cosine similarity average cosine similarity so we'll do that we use from sk learn metrics pairwise import cosine similarity this is just super quick way for us to calculate and then check um s k learn and we'll do cosine similarity and we have i zero xq sure oh so sorry we need to uh so we need to get there the indexes so i i just contains the indexes we need to we need to pull them out of xb okay and we have this so we also need to take the mean of that so cos mean okay 0.4 which is just the average of those the average similarity of those so it's pretty low let's compare that to if we did that for the whole um the whole data set so like this see what we get okay so actually these are yeah i mean these are slightly more similar um so we even you know get slightly better results even even with that terrible sort of bucketing and so what i wanted to do is is repeat that process we just did there but i want to do it for different end bits values so we need to take our index bring it down here and we're going to say this we do index.add our data and then we do d i equals index search as we as we usually do uh k is 10 and then we want to do cos equals cosine similarity again we're going to do x b i so we're pulling out those indexes that we've got um we're doing this different n bits values hence the hence loop and hq okay and then what we're going to do is just print n bits is equal to whatever it's equal to at that point and then new line i'm going to put um what is similarity so the similarity is equal to cos dot mean like that let's just see what we what we get so okay cool so so we look here and so n bits of two is is terrible gets better it gets better and then at 24 so that bit where we have the 0.05 we get the maximum value so we get a 7 3 here 7 4 which is the best we get and then it kind of comes off a little bit here and i've kind of visualized that here so this is the same um so sure i've gone a little bit high with with n bits as well and this is essentially the sort of structure we get so the bit that we just kind of saw was this section here where we're increasing n bits and reducing the number of or reducing like heavily reducing the number of vectors within each bucket on average and the coastline similarity just shoots up really quick but then afterwards we still get this kind of slow increase and that's because so the buckets are now quite spread like the the vectors are spread very nicely between different buckets but as we increase the resolution even further and further all it means is that we're adding more of the original information the original positional information of those vectors and essentially improving the resolution of those vectors so they're just getting slightly more accurate in the whether the hash buckets are closer to other very similar or not similar vectors now essentially the binary vectors are more heavily representing uh the original dense vectors and that's why we see that slight increase that continues even after we get to to ambits of 32. now one final thing i just wanted to quickly cover um is say we do have issues and and we want to sort of visualize the actual buckets themselves you know how how can we how can we extract those buckets and and view and view the number of vectors that we have in each one now there isn't any any way of doing this in in um in face so you we kind of have to put our own code together so what we need to do is i'm going to create first create another index because i want to visualize these so i'm going to do and if i visualize it with 10 bits of 32 there's going to be well how many how many buckets would we have there it would be 2 to the power of 32 so yeah we can't visualize that many buckets so we're going to go with 4 and then bits of 4 which gives us 16 buckets so index lsh we have d and we're going to use a n bits value of four n bits equals four just put that there okay we're going to add our vectors and now what we want to do is actually extract those binary vectors themselves so to do that we write array so we're going to sort them in in the array variable and we do fice dot vector to array like this and in here we do index dot codes so codes um just that word there is essentially that means you know the binary vectors that we've built they are called codes that's all it means but and we use it this word a lot in in similarity search so it's worth remembering also n bits is used a lot as well so in future videos we will uh in this series we you will see them uh quite a lot so then let's have a look at what we have we have okay so we wanted to pull out the binary codes and we're getting these uh these numbers which is not exactly what we expect or or is it right um well okay let's have a look at the min value here zero okay array max okay so we get 0 to 15 which means we get 16 values and okay let's do n to the power of n bits okay so that means we have 16 unique buckets that all of our vectors are spread between so that's interesting let's see how many vectors we have we have one million so we have one million spread between 16 buckets so actually that sounds about right but how do we so these these values here these are just the integer versions of our binary vectors so um well like this is represented by zero and [Music] this is represented by one right so what we can do is use this little little script here to convert them into the actual binary vectors so here this this value is our first one which is a five and the next one is a 12 we should see here so i mean that's that is our there are binary vectors and we can we can visualize how they are are spread um how many items are within each one of those buckets using this array and what we'll get is something that looks like this um so this is what i mean where we have so we have 16 buckets which you can see on the left here so all of these with 16 of those and not all of them even have any any vectors in so we have so we have one two three four that don't even contain any vectors and then the majority the vast majority are contained within uh these and these right so so these four buckets contain almost everything and they have loads they have uh so the count is to the thousands there so here that number add a few more zeros on to the end um so yeah the these four buckets are far too many far too many vectors in for it to be useful so that's why we before we increase it uh increase our in bits value up to 24. now i think i mean that's basically everything i wanted to cover but just be quickly in terms of performance um so visualize these using this is using device l h s l s h index um so we have n bits so obviously we just increase n bits loads uh we increase the recoil performance so that's one thing about lsh is that to get good performance you really have to increase the end bits value a lot and at that point it actually gets it can get really inefficient um so this is a the time for each query as compared to a flight index so obviously i mean you you kind of want to be careful with a lsh index obviously it can be useful um but you know high dimensionality is difficult to balance well with with speed and accuracy but obviously you know it can be done it just depends on how much accuracy you're willing to sacrifice because in most cases with lsh you will be sacrificing a reasonable amount of accuracy so the recoil here is about 50 and so that's uh it's it's managing to to identify 50 around here of the same items that a flat index would would identify and it's only slightly faster so yeah you just have to weigh you know what is most important for your use case with that but anyway that's it for lsh um this book this is the last video we will cover for lsh for now in the next few videos and articles we're going to be covering a few different indexes and and how we use them effectively so it should be pretty interesting but we're definitely getting a bit more advanced than lsh now so it should be good but that's it for this video thank you very much for watching and i'll see you in the next one

Original Description

Faiss  -  or Facebook AI Similarity Search  -  is an open-source framework built for enabling similarity search. Faiss has many super-efficient implementations of different indexes that we can use in similarity search. That long list of indexes includes IndexLSH - an easy-to-use implementation of everything we have covered so far in LSH. 🌲 Pinecone article: https://www.pinecone.io/learn/locality-sensitive-hashing-random-projection/ Download Sift1M: https://gist.github.com/jamescalam/a09a16c17b677f2cf9c019114711f3bf How LSH Random Projection works in search (+Python): https://youtu.be/8bOrMqEdfiQ 🤖 70% Discount on the NLP With Transformers in Python course: https://bit.ly/3DFvvY5 🎉 Sign-up For New Articles Every Week on Medium! https://medium.com/@jamescalam/membership 👾 Discord: https://discord.gg/c5QtDB9RAP 🕹️ Free AI-Powered Code Refactoring with Sourcery: https://sourcery.ai/?utm_source=YouTub&utm_campaign=JBriggs&utm_medium=aff
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from James Briggs · James Briggs · 54 of 60

1 Stoic Philosophy Text Generation with TensorFlow
Stoic Philosophy Text Generation with TensorFlow
James Briggs
2 How to Build TensorFlow Pipelines with tf.data.Dataset
How to Build TensorFlow Pipelines with tf.data.Dataset
James Briggs
3 Every New Feature in Python 3.10.0a2
Every New Feature in Python 3.10.0a2
James Briggs
4 How-to Build a Transformer for Language Classification in TensorFlow
How-to Build a Transformer for Language Classification in TensorFlow
James Briggs
5 How-to use the Kaggle API in Python
How-to use the Kaggle API in Python
James Briggs
6 Language Generation with OpenAI's GPT-2 in Python
Language Generation with OpenAI's GPT-2 in Python
James Briggs
7 Text Summarization with Google AI's T5 in Python
Text Summarization with Google AI's T5 in Python
James Briggs
8 How-to do Sentiment Analysis with Flair in Python
How-to do Sentiment Analysis with Flair in Python
James Briggs
9 Python Environment Setup for Machine Learning
Python Environment Setup for Machine Learning
James Briggs
10 Sequential Model - TensorFlow Essentials #1
Sequential Model - TensorFlow Essentials #1
James Briggs
11 Functional API - TensorFlow Essentials #2
Functional API - TensorFlow Essentials #2
James Briggs
12 Training Parameters - TensorFlow Essentials #3
Training Parameters - TensorFlow Essentials #3
James Briggs
13 Input Data Pipelines - TensorFlow Essentials #4
Input Data Pipelines - TensorFlow Essentials #4
James Briggs
14 6 of Python's Newest and Best Features (3.7-3.9)
6 of Python's Newest and Best Features (3.7-3.9)
James Briggs
15 Novice to Advanced RegEx in Less-than 30 Minutes + Python
Novice to Advanced RegEx in Less-than 30 Minutes + Python
James Briggs
16 Building a PlotLy $GME Chart in Python
Building a PlotLy $GME Chart in Python
James Briggs
17 How-to Use The Reddit API in Python
How-to Use The Reddit API in Python
James Briggs
18 How to Build Custom Q&A Transformer Models in Python
How to Build Custom Q&A Transformer Models in Python
James Briggs
19 How to Build Q&A Models in Python (Transformers)
How to Build Q&A Models in Python (Transformers)
James Briggs
20 How-to Decode Outputs From NLP Models (Python)
How-to Decode Outputs From NLP Models (Python)
James Briggs
21 Identify Stocks on Reddit with SpaCy (NER in Python)
Identify Stocks on Reddit with SpaCy (NER in Python)
James Briggs
22 Sentiment Analysis on ANY Length of Text With Transformers (Python)
Sentiment Analysis on ANY Length of Text With Transformers (Python)
James Briggs
23 Unicode Normalization for NLP in Python
Unicode Normalization for NLP in Python
James Briggs
24 The NEW Match-Case Statement in Python 3.10
The NEW Match-Case Statement in Python 3.10
James Briggs
25 Multi-Class Language Classification With BERT in TensorFlow
Multi-Class Language Classification With BERT in TensorFlow
James Briggs
26 How to Build Python Packages for Pip
How to Build Python Packages for Pip
James Briggs
27 How-to Structure a Q&A ML App
How-to Structure a Q&A ML App
James Briggs
28 How to Index Q&A Data With Haystack and Elasticsearch
How to Index Q&A Data With Haystack and Elasticsearch
James Briggs
29 Q&A Document Retrieval With DPR
Q&A Document Retrieval With DPR
James Briggs
30 How to Use Type Annotations in Python
How to Use Type Annotations in Python
James Briggs
31 Extractive Q&A With Haystack and FastAPI in Python
Extractive Q&A With Haystack and FastAPI in Python
James Briggs
32 Sentence Similarity With Sentence-Transformers in Python
Sentence Similarity With Sentence-Transformers in Python
James Briggs
33 Sentence Similarity With Transformers and PyTorch (Python)
Sentence Similarity With Transformers and PyTorch (Python)
James Briggs
34 NER With Transformers and spaCy (Python)
NER With Transformers and spaCy (Python)
James Briggs
35 Training BERT #1 - Masked-Language Modeling (MLM)
Training BERT #1 - Masked-Language Modeling (MLM)
James Briggs
36 Training BERT #2 - Train With Masked-Language Modeling (MLM)
Training BERT #2 - Train With Masked-Language Modeling (MLM)
James Briggs
37 Training BERT #3 - Next Sentence Prediction (NSP)
Training BERT #3 - Next Sentence Prediction (NSP)
James Briggs
38 Training BERT #4 - Train With Next Sentence Prediction (NSP)
Training BERT #4 - Train With Next Sentence Prediction (NSP)
James Briggs
39 FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
James Briggs
40 New Features in Python 3.10
New Features in Python 3.10
James Briggs
41 Training BERT #5 - Training With BertForPretraining
Training BERT #5 - Training With BertForPretraining
James Briggs
42 How-to Use HuggingFace's Datasets - Transformers From Scratch #1
How-to Use HuggingFace's Datasets - Transformers From Scratch #1
James Briggs
43 Build a Custom Transformer Tokenizer - Transformers From Scratch #2
Build a Custom Transformer Tokenizer - Transformers From Scratch #2
James Briggs
44 3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
James Briggs
45 3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
James Briggs
46 Building MLM Training Input Pipeline - Transformers From Scratch #3
Building MLM Training Input Pipeline - Transformers From Scratch #3
James Briggs
47 Training and Testing an Italian BERT - Transformers From Scratch #4
Training and Testing an Italian BERT - Transformers From Scratch #4
James Briggs
48 Faiss - Introduction to Similarity Search
Faiss - Introduction to Similarity Search
James Briggs
49 Angular App Setup With Material - Stoic Q&A #5
Angular App Setup With Material - Stoic Q&A #5
James Briggs
50 Why are there so many Tokenization methods in HF Transformers?
Why are there so many Tokenization methods in HF Transformers?
James Briggs
51 Choosing Indexes for Similarity Search (Faiss in Python)
Choosing Indexes for Similarity Search (Faiss in Python)
James Briggs
52 Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
James Briggs
53 How LSH Random Projection works in search (+Python)
How LSH Random Projection works in search (+Python)
James Briggs
IndexLSH for Fast Similarity Search in Faiss
IndexLSH for Fast Similarity Search in Faiss
James Briggs
55 Faiss - Vector Compression with PQ and IVFPQ (in Python)
Faiss - Vector Compression with PQ and IVFPQ (in Python)
James Briggs
56 Product Quantization for Vector Similarity Search (+ Python)
Product Quantization for Vector Similarity Search (+ Python)
James Briggs
57 How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
James Briggs
58 Metadata Filtering for Vector Search + Latest Filter Tech
Metadata Filtering for Vector Search + Latest Filter Tech
James Briggs
59 Build NLP Pipelines with HuggingFace Datasets
Build NLP Pipelines with HuggingFace Datasets
James Briggs
60 Composite Indexes and the Faiss Index Factory
Composite Indexes and the Faiss Index Factory
James Briggs

This video teaches how to use IndexLSH in Faiss for fast similarity search, including its implementation, configuration, and performance evaluation. It covers the trade-offs between speed and accuracy in IndexLSH and provides practical steps for applying IndexLSH to real-world problems.

Key Takeaways
  1. Initialize IndexLSH with dimensionality and n bits value
  2. Add data to IndexLSH
  3. Search for k nearest neighbors using IndexLSH
  4. Increase n bits value to get a better distribution between buckets
  5. Calculate average similarity using cosine similarity
  6. Repeat process for different n bits values
  7. Print n bits value and corresponding cosine similarity value
  8. Create an IndexLSH with a specified number of bits
  9. Add vectors to the IndexLSH
  10. Extract the binary vectors from the IndexLSH
💡 IndexLSH can be useful for fast similarity search, but it sacrifices accuracy for speed, and high dimensionality is difficult to balance with speed and accuracy.

Related Reads

📰
When “Smart” Parsers Fail: Building a Hallucination-Resistant RAG System for the Constitution of…
Learn how to build a hallucination-resistant RAG system to improve AI performance on specific tasks, and why deterministic engineering can be a better approach in certain cases
Medium · Python
📰
Semantic Observability: Engineering Reliability for Production RAG
Learn to engineer reliability for production RAG using semantic observability to identify and fix microservice failures quickly and efficiently
Dev.to · Dumebi Okolo
📰
Stale RAG vs. expensive RAG: how to cache RAG context without serving outdated answers
Learn how to cache RAG context without serving outdated answers to improve the efficiency and accuracy of your Retrieval-Augmented Generation (RAG) system
Dev.to · Vectorlink Labs
📰
Your RAG System Is Lying to You in Production
Learn to identify and fix 14 silent killers in RAG systems that can destroy user trust in production environments, and why it matters for maintaining user trust and system reliability
Medium · RAG
Up next
RRF vs DBSF with Qdrant: Hybrid Retrieval Fusion for RAG in Python
Professor Py: AI Engineering
Watch →