Table Question-Answering with TAPAS in Python
Skills:
Multimodal LLMs90%Prompt Systems Engineering80%Tool Use & Function Calling80%Vector Stores80%Agent Foundations70%
Key Takeaways
The video demonstrates table question-answering with TAPAS in Python, utilizing Pine Cone as a vector database and MPNet as a retriever model, and performs operations like summing and averaging over table values. It showcases the use of TAPAS for table question-answering, initialization of Pinecone Vector database, and creation of new Vector index in Pinecone database.
Full Transcript
today we're going to be taking a look at table question answering which is essentially if you could ask a Excel sheet a question like what is the GDP across both China and Indonesia and it would be able to look at the table identify the two parts of the table that are relevant to that question some of those both together and return you of that answer but imagine we take that and we apply not to just one table in Excel but we apply it to millions or even billions of tables and the system is actually capable of taking our question retrieving the correct table to answer that question and then doing repeating the process I just mentioned before where it drives the specific parts of the table that are relevant to our query and even performs operations like summing over those values or averaging over those values that is what we're going to learn about in this video so let me just describe the process that we're going to be taking in order to implement this we're going to start with a vector database here naturally we'll be using pine cone for that then what we do is we add something called a retriever model now let's retriever model there will be an mpnet model so typically with natural language semantic session mpna is a really good option but this mpnet model has been trained for table or reading tables so this is our retriever model now what's going to happen is we're going to ask a question like the question I mentioned before so something like what is GDP and conditions right so we're going to ask that sort of question we're going to take that it's going to go into this retriever model which is our empty net table retriever model and it's going to encode that text in to a vector and that Vector represents the meaning behind that question so that mpnet encoded Vector goes into our Vector database our Vector database then returns relevant tables which have also been encoded by that mpnet table model and it returns them to our next model which is going to be a table reader model table reader now this table reader model is also going to read our question from up here so as you can see both of those to see what we've returned from Pine Cone over here and also have a question and for the table reader we're going to be using a model called tapas now what Tapas can do is what I mentioned before where we take we take a table essentially and if you want to identify the parts of that table that answer our particular question and if you want to also say whether we need to sum over those parts whether we need to average or whether we don't even need to do anything whether it's just the value itself is relevant announces our question so fashion system we're going to be building let's move on to code and we'll start putting that all together okay so we're going to be running through this table question answering document example from Pine Cone so you can find that docs pine cone IO slash dots table QA and then what I'm going to be doing is just going through the collab so you can just click on open collab and run through the exact same code that I'm going to be going through so that will open this code 11 notebook here this is a another really cool idea and example notebook from our shorts so again thank you for that now the first thing we want to do is come up to runtime go to change runtime type and switch this or make sure this is on GPU if it is not that would just make things a lot faster later on there are a few prerequisites that we need to install torch chatter might take a little bit of time so if it is taking some time to install everything that is the reason why I'm not going to rerun that because I have already done it and then what we want to do is we need to initialize this notebook here this is just from the hunting face status hub that shark is uploaded it it is a subset of the Open Table and text question answering data set which is just a load of text and tables from Wikipedia now once that has downloaded we'll see that we have a few features of URL so where is it from title headers which is literally the headers of the table and then data Within that table so we can have a look at one of those now the bits that we are most interested in is here so the headers so this is about American football no baseball I think one of those things I'm not sure and you have your headers here level Team league manager and then we have the data so in level you would have Triple A Double a a a a rookie then so on the deer that had bits of data in there as well now what we can do is we can format all those into Panda State frames which just makes things a lot easier for us in both reading and later formatting so let's go ahead and do that this again might take a moment to run okay 14 seconds and then we can run this and we can have a look at what I just showed you set from in table format so now we can see that's a lot easy to read nice formatting so great now I want to do is move on to that retriever so remember in that visual before we had the Pinecone Vector database which led into the mpnet table retriever model we're going to go ahead and initialize all of that so the retriever we're going to be using a steep set all mpnet based V2 table model so we execute that as I said this model has been fine-tuned specifically on retrieving and embedding table like data and matching those up to natural language queries now once that has downloaded we'll see this kind of explainer or summary of the model so we have the empty net transform model it's been fine-tuned specifically for this we have the pooling method and it is using mean pulling you can see that there and there's a normalization after so because it has that normalization that means we can use both cosine similarity which we can use if there is normalization or not and we can also use dot product similarity because we add that normalization component now this retriever doesn't expect tables to be in a particular format so we need to initialize to this and let's have a look what that format actually looks like so we are going to have something like this so looking again at that same table at the top here we have the headers and then we have a new line character okay you know we have the new row of the table all these separated by commas as you can see and then we have a new line again so essentially we're just reformatting it into a comma separated file now the next thing we want to do is initialize our pine cone Vector database for that we need an API key which is free and we can get it from this link here if you're in the notebook or if not we'll just head on over to app.pinecone .io so I'll lead it to a sign up or sign in page or it would lead to this if you've already signed up and what you need to do is head over to your default project or any other project if you have other projects in there you go to API keys go to default here and you want to copy this and then you need to just paste it into here now I have pasted mine into a variable called API key so I can add that in there run this and that just initializes our connection to Pine code from there what we need to do is create a new Vector index where we're going to sort all of these formatted table objects but after they've been encoded by our retriever model so I'm going to call my index table QA I'm going to use cosine here although like I said before you can also use dot product similarity dimensionality this just aligns with the model so we can actually see that if we do model.get sentence embedding Sim dimension or retriever okay and then we get this 768. so we could also put that in here if we wanted so rather than hardcoding it you can just do this and yeah we run that for me I've already created this index so that will happen very quickly if you haven't created the index that would probably take like 10-15 seconds to run then what we want to do is we're essentially going to go through our entire data set in batches of 64 we are going to get our process tables we're going to then encode them using our retrieve model and the output that we need to convert into a list for Pinecone we're going to create a set of unique IDs now this is just a count if you prefer you can use something else but this is this works for this example so leave it with that we add all of those into what we call an upset list so we're just going to pass the IDS and embeddings we could also if you wanted to store wagon store the tables locally but you could also saw the plain text version of the tables in a metadata dictionary and upload those but we're not going to do that we're just going to use the local ones for the sake of Simplicity and then what we want to do is just upset the all these into Pinecone we would run that that would take a little bit of time I don't think too long maybe six or seven minutes on collab but I have already run it so okay I can see it's working again here so now I'm just going to stop it because all of these have already been uploaded into my Vector index now we're going to do is begin asking questions so this is not the full what we're doing right now we're just we have the vector database and we have the retriever we don't have the later table reader and we're going to implement that in a moment but for now I just want to see is it going to return the correct table for us so we're going to say what was the GDP of China in 2020 we're going to encode that using the retriever to create our query vector and then we're going to pass that to Pinecone and we're just going to return the top table we could return several tables if we wanted to return 100 if you wanted but we're just going to be applying a reader to a single table so I'm going to go with that for now okay and you see that we get this ID here now this ID is like I said it's a count that we created earlier so we can actually use that value in order to uh extract from our tables that we created earlier in the tables variable we can just extract that so item and you can see that it does seem to give us a pretty relevant table so right top here of China and we have there millions of USB and GDP and the year as well which is 2020. so that looks pretty accurate Okay so we've retrieved the correct table now what we need to do is extract that specific piece of information using our table reader model now for the table reader model we're going to be using a tapas model that has been fine-tuned for this specific task and to do that we need this so we are going to use the model name Google tapasbase fine-tuned wtq and we're going to be using the hook and face Transformers library and we need to initialize a tapas tokenizer which is going to convert our natural language query and the tables themselves into tokens or token IDs they get passed into this tap ask for question answering which is a tapas transform model followed by a question answering head and it will basically go through all those new identify the specific part of the table that answers our question and it can also do things like say whether we need to sum certain values within that table or whether it's average them or do all these different operations which is pretty impressive in my opinion at least so we're going to package all that up into this pipeline here which is a table question answering Pipeline and they will just include our model and the tokenizer we run that and then what we can do is we'll pass the table that we retrieved okay so the the China GDP table and past that and we also pass our query which is what is the GDP of China in 2020. okay and run that click also you see dial 172 take the average over one cell okay so it is correct we should just take the average over this one so because that is our answer so the 27.8 million million I think it is in USD so if we come up here we can see it right there okay so that is our answer now I want to do is I want to ask more questions okay I'm going to ask more questions but I want to do it a lot more efficiently than writing all that code out again so I'm just going to create a a few functions here that will help us so query High income which is going to retrieve their relevant information and then match up to a particular table return that to us and then we want this which is just get the answer to the table and that is just going to feed everything into our pipe and return those answers okay so for this first question I'm saying which car manufacturers produce cars with a top speed of above 108 kilometers per hour now you can see that this is again a super relevant table and this is already at least for me impressive in itself that it's managing to get this and we can see Max weight okay 220 190 185 186 so there's four manufacturers there that do that that is Fiat Bugatti Bend and Miller so we come down here and we're going to do get answer from table and we get this so Fiat Bugatti Benz and Miller is our answer there's no aggregator this is text so it's saying okay you don't need to average or do anything here these are just the answers okay let's do another one which scientist is known for improving the steam engine okay and we can see in this table if we have a look here for improving the steam engine so we should expect the answer to be George Henry Corless let's get the answer from the table George Henry Collis pretty cool let's do another one another kind of simple query and then we'll move on to more advanced queries so what is a maldivian island name for oblu select at sengeli Resort okay we can see a blue c let at sangeli and we have akiri fushi it's probably a terrible pronunciation I'm very sorry to anybody against watching this and yeah we get the right answer of course so that in itself is already really impressive but it actually doesn't stop there it gets even more insane than this we can start asking really more complex questions that take sort of more than one step for this model to figure out so I want to say what was the total GDP of China and Indonesia in 2020 okay let's Curry we should get the same table that we got before yes we do and then we want to get the answer from this table and we get this so we get this aggregator sum so it sounds to sum these two values here okay so the 27.8 million and 3.8 million and you can see here that leave this is correct right so we could just maybe add a little bit of a wrapper function that consumes different types of aggregators like sum or average and just handles that little bit of logic at the end there and we have our answer which is insane so that is yep that's another thing uh really really impressive and it's not just some we actually kind of saw this earlier although it wasn't in that views in the right way but let's have a look at this what is the average carbon emission of power stations in Australia Canada and Germany okay let's take a look K looks pretty accurate although this is just sort of like a random selection of different Power stations in these different countries so it's not perfect but nonetheless we can we can go with this and then we can see okay we haven't had greater average and we need to average over these values here so number one is not being very good who is that Australia yeah very bad but that is really pretty impressive at least to me I was pretty Blown Away with this example so that's it for this video I hope that this has been interesting and useful it definitely is for me I'm really enjoying seeing how we can actually apply question answering to tables and even more so with the little aggregations at the end very little feature but I think makes a pretty big difference so thank you very much for watching the video I hope it has been useful and I will see you again in the next one bye [Music]
Original Description
Table question-answering (QA) is like asking Excel a natural language question and getting a truly intelligent, human-like response. We can ask something like "what is the total GDP across both China and Indonesia?" and Google's TAPAS (the machine learning model) will look at the table, find the two parts of the table needed to answer the question, sum both and return them.
We learn how to apply TAPAS for table question answering using Hugging Face transformers and Python.
We take this further by using a Pinecone vector database with a Microsoft MPNet Table question-answering (QA) model. With this, we can ask the question, search through a million, 10 million, or even a billion tables - retrieve the most relevant tables - and then answer the specific question again with Google's TAPAS.
🌲 Pinecone example:
https://github.com/pinecone-io/examples/blob/master/learn/search/question-answering/table-qa.ipynb
🤖 70% Discount on the NLP With Transformers in Python course:
https://bit.ly/3DFvvY5
🎉 Subscribe for Article and Video Updates!
https://jamescalam.medium.com/subscribe
https://medium.com/@jamescalam/membership
👾 Discord:
https://discord.gg/c5QtDB9RAP
00:00 Intro
01:04 Table QA process
03:38 Getting the code
04:08 Colab GPU and prerequisites
04:33 Dataset download and preprocessing
06:10 Table QA retrieval pipeline
11:29 First test, can it retrieve tables?
12:55 TAPAS model for table QA
15:04 Asking more table QA questions
17:37 Asking advanced aggregation questions to TAPAS
19:38 Final thoughts
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from James Briggs · James Briggs · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Stoic Philosophy Text Generation with TensorFlow
James Briggs
How to Build TensorFlow Pipelines with tf.data.Dataset
James Briggs
Every New Feature in Python 3.10.0a2
James Briggs
How-to Build a Transformer for Language Classification in TensorFlow
James Briggs
How-to use the Kaggle API in Python
James Briggs
Language Generation with OpenAI's GPT-2 in Python
James Briggs
Text Summarization with Google AI's T5 in Python
James Briggs
How-to do Sentiment Analysis with Flair in Python
James Briggs
Python Environment Setup for Machine Learning
James Briggs
Sequential Model - TensorFlow Essentials #1
James Briggs
Functional API - TensorFlow Essentials #2
James Briggs
Training Parameters - TensorFlow Essentials #3
James Briggs
Input Data Pipelines - TensorFlow Essentials #4
James Briggs
6 of Python's Newest and Best Features (3.7-3.9)
James Briggs
Novice to Advanced RegEx in Less-than 30 Minutes + Python
James Briggs
Building a PlotLy $GME Chart in Python
James Briggs
How-to Use The Reddit API in Python
James Briggs
How to Build Custom Q&A Transformer Models in Python
James Briggs
How to Build Q&A Models in Python (Transformers)
James Briggs
How-to Decode Outputs From NLP Models (Python)
James Briggs
Identify Stocks on Reddit with SpaCy (NER in Python)
James Briggs
Sentiment Analysis on ANY Length of Text With Transformers (Python)
James Briggs
Unicode Normalization for NLP in Python
James Briggs
The NEW Match-Case Statement in Python 3.10
James Briggs
Multi-Class Language Classification With BERT in TensorFlow
James Briggs
How to Build Python Packages for Pip
James Briggs
How-to Structure a Q&A ML App
James Briggs
How to Index Q&A Data With Haystack and Elasticsearch
James Briggs
Q&A Document Retrieval With DPR
James Briggs
How to Use Type Annotations in Python
James Briggs
Extractive Q&A With Haystack and FastAPI in Python
James Briggs
Sentence Similarity With Sentence-Transformers in Python
James Briggs
Sentence Similarity With Transformers and PyTorch (Python)
James Briggs
NER With Transformers and spaCy (Python)
James Briggs
Training BERT #1 - Masked-Language Modeling (MLM)
James Briggs
Training BERT #2 - Train With Masked-Language Modeling (MLM)
James Briggs
Training BERT #3 - Next Sentence Prediction (NSP)
James Briggs
Training BERT #4 - Train With Next Sentence Prediction (NSP)
James Briggs
FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
James Briggs
New Features in Python 3.10
James Briggs
Training BERT #5 - Training With BertForPretraining
James Briggs
How-to Use HuggingFace's Datasets - Transformers From Scratch #1
James Briggs
Build a Custom Transformer Tokenizer - Transformers From Scratch #2
James Briggs
3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
James Briggs
3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
James Briggs
Building MLM Training Input Pipeline - Transformers From Scratch #3
James Briggs
Training and Testing an Italian BERT - Transformers From Scratch #4
James Briggs
Faiss - Introduction to Similarity Search
James Briggs
Angular App Setup With Material - Stoic Q&A #5
James Briggs
Why are there so many Tokenization methods in HF Transformers?
James Briggs
Choosing Indexes for Similarity Search (Faiss in Python)
James Briggs
Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
James Briggs
How LSH Random Projection works in search (+Python)
James Briggs
IndexLSH for Fast Similarity Search in Faiss
James Briggs
Faiss - Vector Compression with PQ and IVFPQ (in Python)
James Briggs
Product Quantization for Vector Similarity Search (+ Python)
James Briggs
How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
James Briggs
Metadata Filtering for Vector Search + Latest Filter Tech
James Briggs
Build NLP Pipelines with HuggingFace Datasets
James Briggs
Composite Indexes and the Faiss Index Factory
James Briggs
More on: Multimodal LLMs
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
The AI Moat Paradox: The Better Models Become, the Less Models Matter
Medium · AI
170,927 AI Papers Reveal the Biggest Research Shifts of the First Half of 2026
Medium · Machine Learning
170,927 AI Papers Reveal the Biggest Research Shifts of the First Half of 2026
Medium · Data Science
[PoV] When Everyone Is Smart, No One Is
Medium · AI
Chapters (11)
Intro
1:04
Table QA process
3:38
Getting the code
4:08
Colab GPU and prerequisites
4:33
Dataset download and preprocessing
6:10
Table QA retrieval pipeline
11:29
First test, can it retrieve tables?
12:55
TAPAS model for table QA
15:04
Asking more table QA questions
17:37
Asking advanced aggregation questions to TAPAS
19:38
Final thoughts
🎓
Tutor Explanation
DeepCamp AI