How to Index Q&A Data With Haystack and Elasticsearch
Key Takeaways
This video demonstrates how to index Q&A data with Haystack and Elasticsearch, covering the installation of Elasticsearch, creating a new index, and indexing Q&A data from a file using Haystack and Elasticsearch.
Full Transcript
okay so in this video what we're going to do is actually index our data so at the moment we just have all of our paragraphs from meditations by marcus aurelius and to do this we are going to be using the elasticsearch document so so of course if we're using lesson search we first need to actually download and install it so i'm just going to take you through those steps now and all we need to do is head on over to this website up here and elasticsearch.co and you can see the address just there now i'm going to follow the instructions for windows but of course if you're on linux or mac just follow through it's very similar either way so here we're going to install it on windows using the msi installer so just scroll down here and we can see we can download the package from this link so download that and once you download it just open it and we'll see this window pop up so once you see this window pop up we just go through with all of the default settings so install a service and continue through obviously if you do need to change anything change it but for me there's nothing here that i want to modify notice here we have the http port and we're using knight two zero zero we'll be using that later we just continue through here default settings and then we click install and we just let that install okay so now that we've installed elasticsearch we can go ahead and actually check that it's running so to do that we're going to import python requests and whenever we interact with elasticsearch it's either going to be through haystack or it will be through the request library and we'll just interact with the elasticsearch api so to check the health of our cluster so essentially check that's actually up and running all we need to do is send a get request to localhost and if you remember earlier we had it was port to 9200 of course if the port on yours was different modify it this is just the default value and after this we need to reach out to the cluster endpoint and then we are checking the health and then we'll just format that as a json so what you should see here is we have our cluster which is elasticsearch may have a different name if you modified it but by default it's elasticsearch the status is yellow which basically just means we have one node up and running you can have multiple nodes in elasticsearch and for your cluster health to be green it will expect your shards of indexes to have a backup charge across different nodes and obviously we can't do that if we only have one node but it's completely fine for us because we're just in development if you're in production yes you'd probably want it to have those backup shards if not that made any sense don't worry about it we really don't need to know any of that for what we're doing here now what we can also do is we can check if we have any indices already now if i take a look at mine i will already have some indices set up which i've just set up prior to recording this and to check that we go to [Music] localhost again and this time we want to call the cat api which is what we would call whenever we want to see data in a table human readable format rather than json and what we're checking here are the indices and we'll just add text onto there so we can actually see that and this is quite messy so if we just print it instead look a bit cleaner okay so you can see i have these two indices you shouldn't i don't think have either of those no you won't have either those so don't worry about that now what we are going to do is create a new index which will be called aurelius and that is where we will put our documents now to actually implement that we will be going through the haystack library which you can pip install farm haystack and what we want to do is from haystack dot document store elastic search import elastic search document store so this is our document store instance and of course this is not aware of our elasticsearch instance we need to initialize that so we'll store it in a variable called dot store and all we write is elasticsearch document store now we need to initialize it with the parameters so it knows where to connect to our elasticsearch instance so to do that we write host and this is localhost now if you have a username and password set which you don't by default you will need to enter them in here i don't have any set so no worries and then we also need to specify our index and at the moment we don't have an aurelius index and that's fine because this will initialize it for us so we'll just call it aurelius and if we go down here we can see what it actually did so it sent a put request to here localhost 9200 aurelius so that's how you create a new index after that what we want to do is first import our data so we have the data here which i got from this website and process with this script which you can find on github i'll keep a link in the description so you can just go and copy that if you need to now i haven't really done much preprocessor it's pretty straightforward and all you need to do here is actually open that data so we do that with open and from here that data file is located two folders up in a data folder it's called meditations.txt i'm going to be reading that and all we do is data equals f dot read and then if we just have a quick look at first 100 characters there we see that we have this new line character and that signifies a new paragraph from the text so what we want to do here is split the data by newline and then if we check the length of that you see that we have 508 separate paragraphs in there so what we now want to do is we want to modify this data so that it's in the correct format for haystack and elasticsearch so that format looks like this so it expects a list of dictionaries where each dictionary looks like this the text and inside here we would have our paragraph so each one of these items here and then there's another optional field called meta and meta contains a dictionary and in here we can put whatever we want so for us i don't think at the moment there's really that much to put into here other than where it came from so the the book or maybe maybe the source is probably a better word to use here and all of these are coming from meditations now later on we will probably add a few other books as well and then the source will be different and when we return that item from our retriever and our reader will at least be able to see which book it came from him would be also be pretty cool to maybe include like a page number or something but at the moment with this there are no page numbers included so we don't we're not doing that at the moment so that's the format that we need and it's going to be a list of these so to do that we'll just do some list comprehension so we're going to write this and let's just copy this i think yeah it should be fine we'll copy this and just indent that and in here we have our paragraph and sources meditations for all of them and then we just write four paragraph in and data okay so yeah that should work and if we just check what we have here okay so that's that's what we want so we have text we have a paragraph and then in here we have this meta with a source which is always meditations at the moment so that looks pretty good and we'll just double check the length again it should be five zero eight okay perfect now what we need to do is index all of these documents into our elastic search instance and to do that it's it's super easy all we do is called dot store because we're doing this through haystack now and we do write documents and we just pass in our data.json and that should work okay cool so we can see here what it's done as it's sent a post request to the bulk api and sent two of them i assume because it can only send so many documents at once so that's pretty cool and now what i want to check is that we actually have 508 documents in our elasticsearch instance so to do that we're going to revert back to requests so we'll do requests dot get again go to our localhost nine two zero zero and here we need to specify the index that we want to count the number of entries in and then all we do is add count on to the end there and this will return a json object so we do this so that we can see it and sure enough we have 508 items in that document store so if we head on back to our original plan so up here we had meditations we've now got that and we've also setup the first part of our sac over here so elastic now has meditations in there so we can cross that off now the next step is setting up our retriever which we'll cover in the next video so that's everything for this video i hope you enjoyed and i will see you again in the next one
Original Description
▶️ Stoic Q&A App Playlist: https://www.youtube.com/playlist?list=PLIUOU7oqGTLixb-CatMxNCO-mJioMmZEB
The second video in 'Building a Stoic Q&A App' - here we're setting up Elasticsearch and Haystack to store the data (Meditations) ready for retrieval when we ask our app questions.
Find the code here:
https://github.com/jamescalam/aurelius/tree/main/code/labs
🤖 70% Discount on the NLP With Transformers in Python course:
https://bit.ly/3DFvvY5
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from James Briggs · James Briggs · 28 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
▶
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Stoic Philosophy Text Generation with TensorFlow
James Briggs
How to Build TensorFlow Pipelines with tf.data.Dataset
James Briggs
Every New Feature in Python 3.10.0a2
James Briggs
How-to Build a Transformer for Language Classification in TensorFlow
James Briggs
How-to use the Kaggle API in Python
James Briggs
Language Generation with OpenAI's GPT-2 in Python
James Briggs
Text Summarization with Google AI's T5 in Python
James Briggs
How-to do Sentiment Analysis with Flair in Python
James Briggs
Python Environment Setup for Machine Learning
James Briggs
Sequential Model - TensorFlow Essentials #1
James Briggs
Functional API - TensorFlow Essentials #2
James Briggs
Training Parameters - TensorFlow Essentials #3
James Briggs
Input Data Pipelines - TensorFlow Essentials #4
James Briggs
6 of Python's Newest and Best Features (3.7-3.9)
James Briggs
Novice to Advanced RegEx in Less-than 30 Minutes + Python
James Briggs
Building a PlotLy $GME Chart in Python
James Briggs
How-to Use The Reddit API in Python
James Briggs
How to Build Custom Q&A Transformer Models in Python
James Briggs
How to Build Q&A Models in Python (Transformers)
James Briggs
How-to Decode Outputs From NLP Models (Python)
James Briggs
Identify Stocks on Reddit with SpaCy (NER in Python)
James Briggs
Sentiment Analysis on ANY Length of Text With Transformers (Python)
James Briggs
Unicode Normalization for NLP in Python
James Briggs
The NEW Match-Case Statement in Python 3.10
James Briggs
Multi-Class Language Classification With BERT in TensorFlow
James Briggs
How to Build Python Packages for Pip
James Briggs
How-to Structure a Q&A ML App
James Briggs
How to Index Q&A Data With Haystack and Elasticsearch
James Briggs
Q&A Document Retrieval With DPR
James Briggs
How to Use Type Annotations in Python
James Briggs
Extractive Q&A With Haystack and FastAPI in Python
James Briggs
Sentence Similarity With Sentence-Transformers in Python
James Briggs
Sentence Similarity With Transformers and PyTorch (Python)
James Briggs
NER With Transformers and spaCy (Python)
James Briggs
Training BERT #1 - Masked-Language Modeling (MLM)
James Briggs
Training BERT #2 - Train With Masked-Language Modeling (MLM)
James Briggs
Training BERT #3 - Next Sentence Prediction (NSP)
James Briggs
Training BERT #4 - Train With Next Sentence Prediction (NSP)
James Briggs
FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
James Briggs
New Features in Python 3.10
James Briggs
Training BERT #5 - Training With BertForPretraining
James Briggs
How-to Use HuggingFace's Datasets - Transformers From Scratch #1
James Briggs
Build a Custom Transformer Tokenizer - Transformers From Scratch #2
James Briggs
3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
James Briggs
3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
James Briggs
Building MLM Training Input Pipeline - Transformers From Scratch #3
James Briggs
Training and Testing an Italian BERT - Transformers From Scratch #4
James Briggs
Faiss - Introduction to Similarity Search
James Briggs
Angular App Setup With Material - Stoic Q&A #5
James Briggs
Why are there so many Tokenization methods in HF Transformers?
James Briggs
Choosing Indexes for Similarity Search (Faiss in Python)
James Briggs
Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
James Briggs
How LSH Random Projection works in search (+Python)
James Briggs
IndexLSH for Fast Similarity Search in Faiss
James Briggs
Faiss - Vector Compression with PQ and IVFPQ (in Python)
James Briggs
Product Quantization for Vector Similarity Search (+ Python)
James Briggs
How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
James Briggs
Metadata Filtering for Vector Search + Latest Filter Tech
James Briggs
Build NLP Pipelines with HuggingFace Datasets
James Briggs
Composite Indexes and the Faiss Index Factory
James Briggs
More on: LLM Foundations
View skill →Related Reads
📰
📰
📰
📰
Your AI Prompts Have an Expiration Date (And Nobody Told You)
Medium · AI
NotebookLM for Students: How to Turn Lecture Notes into Better Grades
Medium · AI
Asking Fable about Machine Consciousness got Interesting…
Medium · AI
The Second Brain They Can’t Subpoena: Local RAG on a Pi 5
Dev.to · v. Splicer
🎓
Tutor Explanation
DeepCamp AI