How to Index Q&A Data With Haystack and Elasticsearch

James Briggs · Intermediate ·🧠 Large Language Models ·5y ago

Skills: LLM Foundations90%LLM Engineering80%RAG Basics70%Prompt Craft60%Fine-tuning LLMs50%

Key Takeaways

This video demonstrates how to index Q&A data with Haystack and Elasticsearch, covering the installation of Elasticsearch, creating a new index, and indexing Q&A data from a file using Haystack and Elasticsearch.

Full Transcript

okay so in this video what we're going to do is actually index our data so at the moment we just have all of our paragraphs from meditations by marcus aurelius and to do this we are going to be using the elasticsearch document so so of course if we're using lesson search we first need to actually download and install it so i'm just going to take you through those steps now and all we need to do is head on over to this website up here and elasticsearch.co and you can see the address just there now i'm going to follow the instructions for windows but of course if you're on linux or mac just follow through it's very similar either way so here we're going to install it on windows using the msi installer so just scroll down here and we can see we can download the package from this link so download that and once you download it just open it and we'll see this window pop up so once you see this window pop up we just go through with all of the default settings so install a service and continue through obviously if you do need to change anything change it but for me there's nothing here that i want to modify notice here we have the http port and we're using knight two zero zero we'll be using that later we just continue through here default settings and then we click install and we just let that install okay so now that we've installed elasticsearch we can go ahead and actually check that it's running so to do that we're going to import python requests and whenever we interact with elasticsearch it's either going to be through haystack or it will be through the request library and we'll just interact with the elasticsearch api so to check the health of our cluster so essentially check that's actually up and running all we need to do is send a get request to localhost and if you remember earlier we had it was port to 9200 of course if the port on yours was different modify it this is just the default value and after this we need to reach out to the cluster endpoint and then we are checking the health and then we'll just format that as a json so what you should see here is we have our cluster which is elasticsearch may have a different name if you modified it but by default it's elasticsearch the status is yellow which basically just means we have one node up and running you can have multiple nodes in elasticsearch and for your cluster health to be green it will expect your shards of indexes to have a backup charge across different nodes and obviously we can't do that if we only have one node but it's completely fine for us because we're just in development if you're in production yes you'd probably want it to have those backup shards if not that made any sense don't worry about it we really don't need to know any of that for what we're doing here now what we can also do is we can check if we have any indices already now if i take a look at mine i will already have some indices set up which i've just set up prior to recording this and to check that we go to [Music] localhost again and this time we want to call the cat api which is what we would call whenever we want to see data in a table human readable format rather than json and what we're checking here are the indices and we'll just add text onto there so we can actually see that and this is quite messy so if we just print it instead look a bit cleaner okay so you can see i have these two indices you shouldn't i don't think have either of those no you won't have either those so don't worry about that now what we are going to do is create a new index which will be called aurelius and that is where we will put our documents now to actually implement that we will be going through the haystack library which you can pip install farm haystack and what we want to do is from haystack dot document store elastic search import elastic search document store so this is our document store instance and of course this is not aware of our elasticsearch instance we need to initialize that so we'll store it in a variable called dot store and all we write is elasticsearch document store now we need to initialize it with the parameters so it knows where to connect to our elasticsearch instance so to do that we write host and this is localhost now if you have a username and password set which you don't by default you will need to enter them in here i don't have any set so no worries and then we also need to specify our index and at the moment we don't have an aurelius index and that's fine because this will initialize it for us so we'll just call it aurelius and if we go down here we can see what it actually did so it sent a put request to here localhost 9200 aurelius so that's how you create a new index after that what we want to do is first import our data so we have the data here which i got from this website and process with this script which you can find on github i'll keep a link in the description so you can just go and copy that if you need to now i haven't really done much preprocessor it's pretty straightforward and all you need to do here is actually open that data so we do that with open and from here that data file is located two folders up in a data folder it's called meditations.txt i'm going to be reading that and all we do is data equals f dot read and then if we just have a quick look at first 100 characters there we see that we have this new line character and that signifies a new paragraph from the text so what we want to do here is split the data by newline and then if we check the length of that you see that we have 508 separate paragraphs in there so what we now want to do is we want to modify this data so that it's in the correct format for haystack and elasticsearch so that format looks like this so it expects a list of dictionaries where each dictionary looks like this the text and inside here we would have our paragraph so each one of these items here and then there's another optional field called meta and meta contains a dictionary and in here we can put whatever we want so for us i don't think at the moment there's really that much to put into here other than where it came from so the the book or maybe maybe the source is probably a better word to use here and all of these are coming from meditations now later on we will probably add a few other books as well and then the source will be different and when we return that item from our retriever and our reader will at least be able to see which book it came from him would be also be pretty cool to maybe include like a page number or something but at the moment with this there are no page numbers included so we don't we're not doing that at the moment so that's the format that we need and it's going to be a list of these so to do that we'll just do some list comprehension so we're going to write this and let's just copy this i think yeah it should be fine we'll copy this and just indent that and in here we have our paragraph and sources meditations for all of them and then we just write four paragraph in and data okay so yeah that should work and if we just check what we have here okay so that's that's what we want so we have text we have a paragraph and then in here we have this meta with a source which is always meditations at the moment so that looks pretty good and we'll just double check the length again it should be five zero eight okay perfect now what we need to do is index all of these documents into our elastic search instance and to do that it's it's super easy all we do is called dot store because we're doing this through haystack now and we do write documents and we just pass in our data.json and that should work okay cool so we can see here what it's done as it's sent a post request to the bulk api and sent two of them i assume because it can only send so many documents at once so that's pretty cool and now what i want to check is that we actually have 508 documents in our elasticsearch instance so to do that we're going to revert back to requests so we'll do requests dot get again go to our localhost nine two zero zero and here we need to specify the index that we want to count the number of entries in and then all we do is add count on to the end there and this will return a json object so we do this so that we can see it and sure enough we have 508 items in that document store so if we head on back to our original plan so up here we had meditations we've now got that and we've also setup the first part of our sac over here so elastic now has meditations in there so we can cross that off now the next step is setting up our retriever which we'll cover in the next video so that's everything for this video i hope you enjoyed and i will see you again in the next one

Original Description

▶️ Stoic Q&A App Playlist: https://www.youtube.com/playlist?list=PLIUOU7oqGTLixb-CatMxNCO-mJioMmZEB The second video in 'Building a Stoic Q&A App' - here we're setting up Elasticsearch and Haystack to store the data (Meditations) ready for retrieval when we ask our app questions. Find the code here: https://github.com/jamescalam/aurelius/tree/main/code/labs 🤖 70% Discount on the NLP With Transformers in Python course: https://bit.ly/3DFvvY5

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from James Briggs · James Briggs · 28 of 60

← Previous Next →

Stoic Philosophy Text Generation with TensorFlow

Stoic Philosophy Text Generation with TensorFlow

How to Build TensorFlow Pipelines with tf.data.Dataset

How to Build TensorFlow Pipelines with tf.data.Dataset

Every New Feature in Python 3.10.0a2

Every New Feature in Python 3.10.0a2

How-to Build a Transformer for Language Classification in TensorFlow

How-to Build a Transformer for Language Classification in TensorFlow

How-to use the Kaggle API in Python

How-to use the Kaggle API in Python

Language Generation with OpenAI's GPT-2 in Python

Language Generation with OpenAI's GPT-2 in Python

Text Summarization with Google AI's T5 in Python

Text Summarization with Google AI's T5 in Python

How-to do Sentiment Analysis with Flair in Python

How-to do Sentiment Analysis with Flair in Python

Python Environment Setup for Machine Learning

Python Environment Setup for Machine Learning

Sequential Model - TensorFlow Essentials #1

Sequential Model - TensorFlow Essentials #1

Functional API - TensorFlow Essentials #2

Functional API - TensorFlow Essentials #2

Training Parameters - TensorFlow Essentials #3

Training Parameters - TensorFlow Essentials #3

Input Data Pipelines - TensorFlow Essentials #4

Input Data Pipelines - TensorFlow Essentials #4

6 of Python's Newest and Best Features (3.7-3.9)

6 of Python's Newest and Best Features (3.7-3.9)

Novice to Advanced RegEx in Less-than 30 Minutes + Python

Novice to Advanced RegEx in Less-than 30 Minutes + Python

Building a PlotLy $GME Chart in Python

Building a PlotLy $GME Chart in Python

How-to Use The Reddit API in Python

How-to Use The Reddit API in Python

How to Build Custom Q&A Transformer Models in Python

How to Build Custom Q&A Transformer Models in Python

How to Build Q&A Models in Python (Transformers)

How to Build Q&A Models in Python (Transformers)

How-to Decode Outputs From NLP Models (Python)

How-to Decode Outputs From NLP Models (Python)

Identify Stocks on Reddit with SpaCy (NER in Python)

Identify Stocks on Reddit with SpaCy (NER in Python)

Sentiment Analysis on ANY Length of Text With Transformers (Python)

Sentiment Analysis on ANY Length of Text With Transformers (Python)

Unicode Normalization for NLP in Python

Unicode Normalization for NLP in Python

The NEW Match-Case Statement in Python 3.10

The NEW Match-Case Statement in Python 3.10

Multi-Class Language Classification With BERT in TensorFlow

Multi-Class Language Classification With BERT in TensorFlow

How to Build Python Packages for Pip

How to Build Python Packages for Pip

How-to Structure a Q&A ML App

How-to Structure a Q&A ML App

How to Index Q&A Data With Haystack and Elasticsearch

How to Index Q&A Data With Haystack and Elasticsearch

Q&A Document Retrieval With DPR

Q&A Document Retrieval With DPR

How to Use Type Annotations in Python

How to Use Type Annotations in Python

Extractive Q&A With Haystack and FastAPI in Python

Extractive Q&A With Haystack and FastAPI in Python

Sentence Similarity With Sentence-Transformers in Python

Sentence Similarity With Sentence-Transformers in Python

Sentence Similarity With Transformers and PyTorch (Python)

Sentence Similarity With Transformers and PyTorch (Python)

NER With Transformers and spaCy (Python)

NER With Transformers and spaCy (Python)

Training BERT #1 - Masked-Language Modeling (MLM)

Training BERT #1 - Masked-Language Modeling (MLM)

Training BERT #2 - Train With Masked-Language Modeling (MLM)

Training BERT #2 - Train With Masked-Language Modeling (MLM)

Training BERT #3 - Next Sentence Prediction (NSP)

Training BERT #3 - Next Sentence Prediction (NSP)

Training BERT #4 - Train With Next Sentence Prediction (NSP)

Training BERT #4 - Train With Next Sentence Prediction (NSP)

FREE 11 Hour NLP Transformers Course (Next 3 Days Only)

FREE 11 Hour NLP Transformers Course (Next 3 Days Only)

New Features in Python 3.10

New Features in Python 3.10

Training BERT #5 - Training With BertForPretraining

Training BERT #5 - Training With BertForPretraining

How-to Use HuggingFace's Datasets - Transformers From Scratch #1

How-to Use HuggingFace's Datasets - Transformers From Scratch #1

Build a Custom Transformer Tokenizer - Transformers From Scratch #2

Build a Custom Transformer Tokenizer - Transformers From Scratch #2

3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)

3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

Building MLM Training Input Pipeline - Transformers From Scratch #3

Building MLM Training Input Pipeline - Transformers From Scratch #3

Training and Testing an Italian BERT - Transformers From Scratch #4

Training and Testing an Italian BERT - Transformers From Scratch #4

Faiss - Introduction to Similarity Search

Faiss - Introduction to Similarity Search

Angular App Setup With Material - Stoic Q&A #5

Angular App Setup With Material - Stoic Q&A #5

Why are there so many Tokenization methods in HF Transformers?

Why are there so many Tokenization methods in HF Transformers?

Choosing Indexes for Similarity Search (Faiss in Python)

Choosing Indexes for Similarity Search (Faiss in Python)

Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)

Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)

How LSH Random Projection works in search (+Python)

How LSH Random Projection works in search (+Python)

IndexLSH for Fast Similarity Search in Faiss

IndexLSH for Fast Similarity Search in Faiss

Faiss - Vector Compression with PQ and IVFPQ (in Python)

Faiss - Vector Compression with PQ and IVFPQ (in Python)

Product Quantization for Vector Similarity Search (+ Python)

Product Quantization for Vector Similarity Search (+ Python)

How to Build a Bert WordPiece Tokenizer in Python and HuggingFace

How to Build a Bert WordPiece Tokenizer in Python and HuggingFace

Metadata Filtering for Vector Search + Latest Filter Tech

Metadata Filtering for Vector Search + Latest Filter Tech

Build NLP Pipelines with HuggingFace Datasets

Build NLP Pipelines with HuggingFace Datasets

Composite Indexes and the Faiss Index Factory

Composite Indexes and the Faiss Index Factory

This video teaches how to index Q&A data with Haystack and Elasticsearch, covering the installation of Elasticsearch, creating a new index, and indexing Q&A data from a file using Haystack and Elasticsearch. The practical steps and code examples provided make it easy to follow along and implement the concepts in your own project.

Key Takeaways

Download and install Elasticsearch on Windows using MSI installer
Check Elasticsearch cluster health using Python requests
Create a new index called aurelius using Haystack library
Import Q&A data from meditations.txt file
Modify data to fit Haystack and Elasticsearch format
Index documents into Elasticsearch instance
Call `dot store` to index documents
Pass in `data.json` to index documents
Revert back to `requests` to verify document count

💡 Using Haystack with Elasticsearch provides a powerful way to index and retrieve Q&A data, and the steps outlined in this video can be applied to a variety of use cases.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related Reads

How AI and ChatGPT are Upgrading Data

Learn how AI and ChatGPT are revolutionizing data management in 2026

Medium · ChatGPT

Semantic Caching for LLMs: What’s Draining Your AI Budget

Learn how semantic caching can help optimize LLM costs and reduce AI budget drain

Medium · Machine Learning

Running Hugging Face Inference with Kiro: From Prompt to Working Summarizer

Learn to build a text summarizer using Hugging Face and Kiro, streamlining NLP workflows

BizNode's semantic memory (Qdrant) makes your bot smarter over time — it remembers past conversations and answers...

Learn how BizNode's semantic memory (Qdrant) enhances bot intelligence by remembering past conversations and answers, and how to apply this technology to improve your own chatbots

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)