How LSH Random Projection works in search (+Python)

James Briggs · Beginner ·🔍 RAG & Vector Search ·4y ago

Key Takeaways

The video demonstrates Locality Sensitive Hashing (LSH) with random projection, a technique used in approximate similarity search, and provides a Python implementation to showcase its efficiency in search.

Full Transcript

okay so we're on to lsh or locality sensitive hashing with random projection in the previous video in the series if you're if you're following it we covered lsh but we covered the more traditional version of it with chingling minhashing and lsh what we're covering here is i suppose more of like a modern implementation of it and this is what you'll see in libraries like fights which we'll work through later on how to actually implement this in device so what we're going to cover in this video is specifically lsh random projection we're going to work through a few visualizations to try and get a grasp of what it is actually doing and whilst we're doing that we'll also work through how we implement that in very simple python code for the vice implementation of this which is obviously much more efficient than what we'll be covering in this video we will cover that in another separate video to this because otherwise it's just a huge a very long video so if you would like to see that i'll make sure there's a link to that in the description so let's just recap very quickly on you know what is lsh so at the top here we have this first kind of row if you like um here we're minimizing collisions so this is a hash function this blue bit in the middle and these are vectors on the left and we're passing through our hash function and it's making sure that we put both of those into separate buckets that's minimizing collisions lsh tries to do the opposite and what it does is maximizes collisions but it only tries to maximize collisions for similar vectors not just everything so lsh in short is a hashing function that tries to book it similar vectors together now when we're performing search with lsh we have all of our i would say database vectors indexed already so they've all been butted imagine they've all been bucketed and we introduced a new vector which would be our query vector and we process that through the same bucketing mechanism and we see where it lands and then essentially what we can say is okay this is landed in this bucket and that means that the most similar other vectors to it are the ones that are either in the same bucket or in the neighboring buckets so what it does is allows us to compress our vectors into lower resolution factors which makes our search a lot faster and that is what you can see here so we have our dense vectors at the top typically we're using dense vectors for this and they can contain hundreds or thousands of values and they're all typically floating point numbers so memory wise it's pretty heavy and what we do is convert them into a very small binary vector which is what you can you can see the bottom so it's a lot more memory efficient and usually it should be faster to search although sometimes it can actually go the other way and it can and it can become slower to search than just comparing and then just using a flat index now at the same time because we are approximating these vectors the search accuracy is obviously going to decrease but what we want to do really is to maintain decent accuracy whilst speeding up the search and our aim is using xq which is our query vector is to return the k nearest neighbors as accurately as possible obviously we're approximating so it won't be perfect but that's fine as long as you get a decent speed up now lsh with random projection works by splitting our vector space which is obviously highly dimensional vector space using hyperplanes so i mean it's what you can you can see right here and the way that it works is that given a a single hyperplane on the positive side of that hyperplane if your if your vector appeared on that side it would be assigned a positive dot product value okay and then we would process that and in our binary vector this would be assigned a one on the other side of that you have the negative side of the hyperplane and if your vector is on that side of it it will be assigned a negative value with the dot product and with that in our binary vector would be assigned a zero and the reason that that works is we're using the dot product value so imagine this green line down here imagine this is the hyperplane that i just showed you we have that normal vector um the n that comes out here and using the dot product if the dot product finds that both of these are in the same direction eg anything in this sort of angle then it will take that as a positive if is on the other side so anywhere here it would take that as a negative dot product which is what you can see over here now a single binary value isn't going to tell us much about the direction of our vector or the position of our vector so what we do is just add more hyperplanes in there so we add more hyperplanes and that gives us more binary values within our vector so what you can see here is we have those we have two hyperplanes now uh the magenta obviously correlates to the zero index in these vectors and then the teal hyperplane correlates to these ones the the number one indexes so what we would do is essentially just use loads of hyperplanes which is kind of what you see here so these are the big arrows and the points that you see there the normal vectors so where we're actually calculating the dot product and then these are our hyperplanes okay so here for blue we would get a value of zero and then here for example for blue we'd get a value of one now let's start building this out in in code so we can actually see how this works so what i'm going to do is set the number of hyperplanes that we would like and we set that using a parameter here called n bits i'm going to say we have four hyperplanes just for this example in reality we'll use more but for now we're going to go with four so we have four binary values here and all i'm going to do is um create our vector dimensionality as well so we're just going to use 2d vectors to make it easier and so we can sort of visualize stuff as well so all we need to do is we're going to create the plane norms and they are going to be numpy random rand and the dimensionality there will be n bits and d so the number of hybrid planes that we want so n bits and the dimensionality that we're actually using and we're just going to minus 0.5 because we want them to center around the zero the origin zero axis you don't need to do this by the way it's it's just so we can kind of see the effect of it a little bit better again we get get these values now those vectors don't align to this visualization but it's essentially the same thing what what we have done is we've created four 2d hyperplanes or we've actually created four uh 2d normal vectors that we're going to use to build our binary vectors so we're going to create these three vectors a b and c and what we're going to do is calculate the dot products for each one of those so then we know whether they are positive or negative behind each of our outplay norms so to do that we're just going to do mp dot and then we just add our vector and we add plane norms and we just transpose those okay so let me let's see what we get there okay so we see that we get negative negative positive positive okay so when we convert this into our binary vector that'll be zero zero one one now we want to do this not just for a but also for b and c so okay now what we want to do is say okay if it's negative it's zero if it's positive it's a one so to do that all we want to do is write a dot and we say well greater than zero it's a it's a one so it's positive and we do that again for each one of our vectors let's see what we get okay so now we get false false true true so the final thing to do there although i don't you don't necessarily need to is just convert them in fact we do need to uh purely to create our the the binary and vector string so we we'll see why in a moment it's fine as type in so it's essentially easier for us to visualize that's it okay and we should get something that looks like this here um so see here obviously the values the the positions are slightly different we have a is on the positive side of the teal high plane so 1 is of course one and then on c and b are both on the other side so they are of course negative now if we consider a b and c b and c kind of in the more in the in a similar position so we would hope that they kind of align in there in the values that they have a bit better than um a but let's have a look see what we get okay so they're the same so that's good because they are very much in a very similar place which you can see from here so a b and c in this case match up to what we're writing in the code it's just a hyperplane inside a different now it's these binary vectors that we use to create our lh lsh buckets so what we're going to do is actually implement that just using a python dictionary to make it easy so what we'll do is i'm going to put each of our vectors a dot and b dot and c dot into this vectored list it's just so we can iterate through them a little bit easier i'm going to initialize our buckets which are going to be like i said a python dictionary and here i'm just going to set i equal to zero so this is just so we can loop through each of those each of those vectors so we're going to do for i in range length of vectors so yeah we don't need either so the first thing we want to do is is create a hash string using the the vectors that we have up here so i'm going to do hash string equals i'm going to do join like that and what we want to do is join all of those numbers together but we need to to join them as strings so when vectors i as type string okay so if i let me show you what that does hash string okay so we just get something like that now then what we want to do is say if the hash string is not already within our buckets it's not in buckets or keys we want to initialize a new list so buckets hash string equals a new list so we're initializing a list essentially initializing a bucket to put our vectors in so initialize it and then after that we we just add we append it to that bucket so this is essentially what we need let me okay yeah should should be fine so let's print the buckets and see what we get if hash string is what did i write there not in there we go so now we see we have these two hash buckets and one and two have both been been put into the same one all right and that's essentially how late what lsh works but just on a much bigger scale so now let's say you know we we have our buckets and what we want to do is give a new vector we want to search using that we want to hash it and search this is our query vector x cube so what we we see here um so we've got like two examples one on the left is an example so this is we're comparing our query vector against two samples in our lsh buckets so x q is this zero it's been zero one one one and what we do is first we compare it to this vector and we see and we're using hamming distance here which is essentially you know do these two equivalent values match or not if they do it's uh zero there's there's no distance between them if they do not match then the distance is one and then we add up um all of the distance values at the end there so with this one none of them match so we get like one plus one plus one plus one which is obviously four so the hamming distance between those two is four which is uh the biggest you're going to get with this dimensionality of binary vectors and then we have the second one and these ones match a bit better so zero is equal to zero one to one one to one so all those zeros and then so like zero zero zero and then this final one is one so then that equals one okay so that's hamming distance and when we consider that with our code over here we we also have to consider that there's a degree of a degree of information being lost because where this is how we're storing the vectors we don't saw the original vectors anymore this is you know that essentially like final form in the lsh index so say we have our query of x and it comes through a zero zero one zero great we get a perfect match but in reality does it is it close to one or is it closer to two we don't we don't actually know so we have to be careful when we're building our buckets to make sure that there's not too many items within each bucket we need them to be reasonably spread out but not spread too thin because if we spread them too thin across too many buckets the the index becomes absolutely huge so it's it's definitely like a balancing act between having enough buckets um enough granularity in there to differentiate between a reasonable number of vectors but not too granular that we just make the index bigger because then it's slower than just doing a flat search now what you what we see here is what we just discussed right so given these two vectors a and b they're reasonably far from each other if we use a value of n bits value of two uh our vectors are not big enough to get butted into the same place we can't differentiate them but what we can do obviously increase the number of hyper planes or increase n bits and then we can differentiate them so so for you know these two here exactly the same uh buckets these are not we have differences here here and here so we can differentiate between them which obviously is what we need but at the same time we are increasing the size of our index so it means we are becoming more accurate but we're also getting slower so it's yeah finding the middle ground between them both now that's it for the implementation details behind vice what we'll do is we're going to leave this video here and we'll cover device implementation in the next video which i will leave a link to in the description so you can find that easily but for now thank you very much for watching and i'll see you in the next one

Original Description

Locality sensitive hashing (LSH) is a widely popular technique used in approximate similarity search. The solution to efficient similarity search is a profitable one - it is at the core of several billion (and even trillion) dollar companies. The problem with similarity search is scale. Many companies deal with millions-to-billions of data points every single day. Given a billion data points, is it feasible to compare all of them with every search? Further, many companies are not performing single searches - Google deals with more than 3.8 million searches every minute. Billions of data points combined with high-frequency searches are problematic - and we haven't considered the dimensionality nor the similarity function itself. Clearly, an exhaustive search across all data points is unrealistic for larger datasets. The solution to searching impossibly huge datasets? Approximate search. Rather than exhaustively comparing every pair, we approximate - restricting the search scope only to high probability matches. 🌲 Pinecone article: https://www.pinecone.io/learn/locality-sensitive-hashing-random-projection/ Download Sift1M: https://gist.github.com/jamescalam/a09a16c17b677f2cf9c019114711f3bf IndexLSH for Fast Similarity Search in Faiss: https://youtu.be/ZLfdQq_u7Eo 🤖 70% Discount on the NLP With Transformers in Python course: https://bit.ly/3DFvvY5 🎉 Sign-up For New Articles Every Week on Medium! https://medium.com/@jamescalam/membership 👾 Discord: https://discord.gg/c5QtDB9RAP 🕹️ Free AI-Powered Code Refactoring with Sourcery: https://sourcery.ai/?utm_source=YouTub&utm_campaign=JBriggs&utm_medium=aff
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from James Briggs · James Briggs · 53 of 60

1 Stoic Philosophy Text Generation with TensorFlow
Stoic Philosophy Text Generation with TensorFlow
James Briggs
2 How to Build TensorFlow Pipelines with tf.data.Dataset
How to Build TensorFlow Pipelines with tf.data.Dataset
James Briggs
3 Every New Feature in Python 3.10.0a2
Every New Feature in Python 3.10.0a2
James Briggs
4 How-to Build a Transformer for Language Classification in TensorFlow
How-to Build a Transformer for Language Classification in TensorFlow
James Briggs
5 How-to use the Kaggle API in Python
How-to use the Kaggle API in Python
James Briggs
6 Language Generation with OpenAI's GPT-2 in Python
Language Generation with OpenAI's GPT-2 in Python
James Briggs
7 Text Summarization with Google AI's T5 in Python
Text Summarization with Google AI's T5 in Python
James Briggs
8 How-to do Sentiment Analysis with Flair in Python
How-to do Sentiment Analysis with Flair in Python
James Briggs
9 Python Environment Setup for Machine Learning
Python Environment Setup for Machine Learning
James Briggs
10 Sequential Model - TensorFlow Essentials #1
Sequential Model - TensorFlow Essentials #1
James Briggs
11 Functional API - TensorFlow Essentials #2
Functional API - TensorFlow Essentials #2
James Briggs
12 Training Parameters - TensorFlow Essentials #3
Training Parameters - TensorFlow Essentials #3
James Briggs
13 Input Data Pipelines - TensorFlow Essentials #4
Input Data Pipelines - TensorFlow Essentials #4
James Briggs
14 6 of Python's Newest and Best Features (3.7-3.9)
6 of Python's Newest and Best Features (3.7-3.9)
James Briggs
15 Novice to Advanced RegEx in Less-than 30 Minutes + Python
Novice to Advanced RegEx in Less-than 30 Minutes + Python
James Briggs
16 Building a PlotLy $GME Chart in Python
Building a PlotLy $GME Chart in Python
James Briggs
17 How-to Use The Reddit API in Python
How-to Use The Reddit API in Python
James Briggs
18 How to Build Custom Q&A Transformer Models in Python
How to Build Custom Q&A Transformer Models in Python
James Briggs
19 How to Build Q&A Models in Python (Transformers)
How to Build Q&A Models in Python (Transformers)
James Briggs
20 How-to Decode Outputs From NLP Models (Python)
How-to Decode Outputs From NLP Models (Python)
James Briggs
21 Identify Stocks on Reddit with SpaCy (NER in Python)
Identify Stocks on Reddit with SpaCy (NER in Python)
James Briggs
22 Sentiment Analysis on ANY Length of Text With Transformers (Python)
Sentiment Analysis on ANY Length of Text With Transformers (Python)
James Briggs
23 Unicode Normalization for NLP in Python
Unicode Normalization for NLP in Python
James Briggs
24 The NEW Match-Case Statement in Python 3.10
The NEW Match-Case Statement in Python 3.10
James Briggs
25 Multi-Class Language Classification With BERT in TensorFlow
Multi-Class Language Classification With BERT in TensorFlow
James Briggs
26 How to Build Python Packages for Pip
How to Build Python Packages for Pip
James Briggs
27 How-to Structure a Q&A ML App
How-to Structure a Q&A ML App
James Briggs
28 How to Index Q&A Data With Haystack and Elasticsearch
How to Index Q&A Data With Haystack and Elasticsearch
James Briggs
29 Q&A Document Retrieval With DPR
Q&A Document Retrieval With DPR
James Briggs
30 How to Use Type Annotations in Python
How to Use Type Annotations in Python
James Briggs
31 Extractive Q&A With Haystack and FastAPI in Python
Extractive Q&A With Haystack and FastAPI in Python
James Briggs
32 Sentence Similarity With Sentence-Transformers in Python
Sentence Similarity With Sentence-Transformers in Python
James Briggs
33 Sentence Similarity With Transformers and PyTorch (Python)
Sentence Similarity With Transformers and PyTorch (Python)
James Briggs
34 NER With Transformers and spaCy (Python)
NER With Transformers and spaCy (Python)
James Briggs
35 Training BERT #1 - Masked-Language Modeling (MLM)
Training BERT #1 - Masked-Language Modeling (MLM)
James Briggs
36 Training BERT #2 - Train With Masked-Language Modeling (MLM)
Training BERT #2 - Train With Masked-Language Modeling (MLM)
James Briggs
37 Training BERT #3 - Next Sentence Prediction (NSP)
Training BERT #3 - Next Sentence Prediction (NSP)
James Briggs
38 Training BERT #4 - Train With Next Sentence Prediction (NSP)
Training BERT #4 - Train With Next Sentence Prediction (NSP)
James Briggs
39 FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
James Briggs
40 New Features in Python 3.10
New Features in Python 3.10
James Briggs
41 Training BERT #5 - Training With BertForPretraining
Training BERT #5 - Training With BertForPretraining
James Briggs
42 How-to Use HuggingFace's Datasets - Transformers From Scratch #1
How-to Use HuggingFace's Datasets - Transformers From Scratch #1
James Briggs
43 Build a Custom Transformer Tokenizer - Transformers From Scratch #2
Build a Custom Transformer Tokenizer - Transformers From Scratch #2
James Briggs
44 3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
James Briggs
45 3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
James Briggs
46 Building MLM Training Input Pipeline - Transformers From Scratch #3
Building MLM Training Input Pipeline - Transformers From Scratch #3
James Briggs
47 Training and Testing an Italian BERT - Transformers From Scratch #4
Training and Testing an Italian BERT - Transformers From Scratch #4
James Briggs
48 Faiss - Introduction to Similarity Search
Faiss - Introduction to Similarity Search
James Briggs
49 Angular App Setup With Material - Stoic Q&A #5
Angular App Setup With Material - Stoic Q&A #5
James Briggs
50 Why are there so many Tokenization methods in HF Transformers?
Why are there so many Tokenization methods in HF Transformers?
James Briggs
51 Choosing Indexes for Similarity Search (Faiss in Python)
Choosing Indexes for Similarity Search (Faiss in Python)
James Briggs
52 Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
James Briggs
How LSH Random Projection works in search (+Python)
How LSH Random Projection works in search (+Python)
James Briggs
54 IndexLSH for Fast Similarity Search in Faiss
IndexLSH for Fast Similarity Search in Faiss
James Briggs
55 Faiss - Vector Compression with PQ and IVFPQ (in Python)
Faiss - Vector Compression with PQ and IVFPQ (in Python)
James Briggs
56 Product Quantization for Vector Similarity Search (+ Python)
Product Quantization for Vector Similarity Search (+ Python)
James Briggs
57 How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
James Briggs
58 Metadata Filtering for Vector Search + Latest Filter Tech
Metadata Filtering for Vector Search + Latest Filter Tech
James Briggs
59 Build NLP Pipelines with HuggingFace Datasets
Build NLP Pipelines with HuggingFace Datasets
James Briggs
60 Composite Indexes and the Faiss Index Factory
Composite Indexes and the Faiss Index Factory
James Briggs

This video teaches the basics of Locality Sensitive Hashing (LSH) with random projection and its application in efficient similarity search. It provides a Python implementation to demonstrate the technique's efficiency in search. By watching this video, viewers can learn how to implement LSH Random Projection for efficient search and evaluate the trade-off between accuracy and speed.

Key Takeaways
  1. Set the number of hyperplanes using the parameter n bits
  2. Create the hyperplane normals using numpy's random.rand function
  3. Calculate the dot product between the vector and each hyperplane normal
  4. Convert the dot product values to 1 or 0 based on their sign
  5. Create LSH buckets by hashing the binary vectors and grouping them together based on their hash values
  6. Search for a query vector by comparing it to the vectors in the LSH buckets using Hamming distance
💡 LSH Random Projection is a technique that balances the trade-off between accuracy and speed in similarity search by using dimensionality reduction and hashing to group similar vectors together.

Related AI Lessons

Your AI Keeps Making Things Up. RAG Is How You Make It Use Real Facts Instead.
Learn how to use RAG to make your AI provide accurate answers based on real facts instead of making things up
Medium · RAG
Evaluation Metrics for RAG: Measure Retrieval, Generation, and End-to-End Quality With Numbers That…
Learn to evaluate RAG models using metrics that measure retrieval, generation, and end-to-end quality
Medium · AI
Evaluation Metrics for RAG: Measure Retrieval, Generation, and End-to-End Quality With Numbers That…
Learn to evaluate RAG models using metrics that measure retrieval, generation, and end-to-end quality
Medium · Data Science
When Does HyDE Help RAG? I Tested 3 Query Types and It Failed on Two
Learn when HyDE retrieval helps or hinders RAG performance across different query types, and why it matters for improving search accuracy
Medium · AI
Up next
RRF vs DBSF with Qdrant: Hybrid Retrieval Fusion for RAG in Python
Professor Py: AI Engineering
Watch →