How to Build TensorFlow Pipelines with tf.data.Dataset
Skills:
LLM Foundations85%
Key Takeaways
The video demonstrates how to build efficient input pipelines for machine learning using TensorFlow's tf.data.Dataset, covering topics such as data loading, batching, shuffling, and transformation. It highlights the benefits of using tf.data.Dataset, including improved performance and reduced memory usage.
Full Transcript
all right we're gonna go through the tensorflow data sets essentially these are a more efficient built-in way to build our input pipelines so we can see the documentation here if you'd like to go through it you can do i'll leave a link in the description but we are just going to go and dive right into it so to use the pipeline dataset object uh we need to actually import tensorflow of course tf and we're also going to be using pandas and numpy for a few examples here so we'll go ahead and import those as well okay so there's a there's a few different ways that we can read data into our data pipelines uh data sets so the first of those is from in memory which is probably the way that most of you if you have seen this before will have seen it we'll go ahead and put that together quickly so the first one is lit as a list so we can take a couple of python lists put them together and build our data set using that so we'll just put some together really quickly here okay so yourself input now puts both lists and then to create our data object from these two we just type tf dot data dot data set with a capital d and we are taking these as tensor slices like this and one thing that we're doing here we're putting both of these into a tuple because this only accepts one input uh parameter and that input parameter is basically all of the data that we are going to be feeding into our model later on the default format for the dataset object when it's feeding into a model is a simply one input tensor and one output tensor or target or label tensor whatever you'd like to call it here once it loads we will have built our first data set took a lot longer than it should have done um and for item and data set so we're just going to see what it looks like and we'll see it's like um what's it it's a list of tensor arrays so we can see the the tuple format that we created here so this is the the first item we have and the first tense object is a numpy array or numpy integer which is zero which matches up to this and then next to that we have the output value which is the one here okay and then it's the same for the following three rows in there so we can also do the same with numpy raids it's literally pretty much exactly the same exactly the same format like this and this will produce the exact same thing and then uh if we want to use a date frame which i assume a lot of you will do this time you know before we were passing inputs and outputs this time we just do the data frame and we will see okay so we'll create a different a slightly different format here and with this we would reformat or restructure the the data set here uh before feeding into our model for it to read everything correctly but for now we're just gonna leave it like that and then we'll go over the the mapping and everything uh pretty soon so the other option we have for for reading the data into our data set is actually reading it directly from file so from file the benefits we get from doing that is that we are reading in data from an out of memory source and because we're reading from another memory source tensorflow will read data uh batch by batch rather than pulling in the entire data source or the entire data set all at once so if we have like a big data set then this is pretty useful because in a lot of cases we're working with a big data set and we can't actually bring everything into our memory all at once so this allows us to get around that and it does it in an efficient way and it's just super easy as all so uh i have also put this together zoom out a little bit and this is my attempt at demonstrating the difference or demonstrating what the from file version of this does so this is our full data set here and we've batched into three batches here obviously you'd have way more than this um and at any one time we feed in a single batch apply our data set transformations feed into the model for training and then once we're done with that we go into the next batch and then we feed into data transformations and go and train it on the train the model with it so let's do that quickly so i've got this um train.tsv here and that file is actually from the sentiment analysis on the movie reviews from from kaggle so you can download it here you can see link i'll put it in the description and we're going to read from this read from it directly so it's slightly different we do tf.data dot experimental and then we make csv data set so now we are actually using a tab separated values here rather than comma separated values so all we're doing here is we're going to change the field delimited to a tab a character instead so train.tsv and then also in here we uh we actually define our batch size so we're just going to do something really small for now but obviously uh when you are using your using this for your actual models you would probably be doing something like about size of 64 or 128 or whatever it is you're using but we're just gonna go for eight now so we can like like really easily visualize everything and then next i don't know why it's doing that it's fine uh next we do field delimiter so this is where we tell it's actually a tab delimited file rather than comma try and sort that out it's really annoying and then we also need to we need to set the label for our data set which if we if we look here our label is this sentiment field so your sentiment and then actually another really useful um argument here is the select columns so with this we just pass a list of the columns that we want to keep and then it will drop all the other columns so for now i mean it depends on what you're doing obviously here but we're just going to keep the and then we're also going to keep the input and target data which so the id is phrase id and then the input data is phrase um yep and then sentiment for the label let's execute that and then let's just have a quick look at what we have so we use this this take um to just take the first batch within our within our data set if we if we'd say take 20 then it would take the first 20 data the first 20 batches and nothing else we'll just want to see the first one so you can see the actual format within the within the um data set so you can see here we have the phrase id and this is why i wanted to to keep the phrase id in and we don't actually need it for training uh model but i want to show you that it actually shuffles the data so you can see in in here phrase id is one two three four five as in order but then when we read data in with this it actually automatically shuffles everything which is it's a pretty cool feature so yeah it's pretty useful and then here we have phrases uh which is our would be our input data and then here we have the sentiment ratings which would be our target data and so that's everything for the reading reading into our data set and we'll move on to performing a few operations on it so i'm gonna go back and just assume that we're not reading it from here actually let's use this but we're gonna load it into our memory first like this let's let's do this so i'm going to do this because i want to show you the the shuffle and batch methods and obviously if it's already shuffled and batch there's no point showing you but this is useful to know if you're reading things from in memory obviously if you are reading things from your disk then there's no pointing in doing this part and then we're ts feed so we need to keep these separate as a tab okay and i just make sure spread it in correctly okay cool so what i want to show you here uh we actually need to sorry we need to read it into our data set so we're going to use same here and then let's do the for item in dataset.take1 print item okay so this is because we have these uh phrases in here so i mean we don't really need them so let's just go sentence id to make things a bit easier for now so i mean if you're using uh strings obviously for machine learning you're gonna tokenize it so i mean you would you do that first but we're not gonna go all the way through to actually training the model we're just gonna have a look at the the pipelining okay cool so our first row is one one one one one one that's what we expect first thing is to actually do the shuffling on the batch like i said before so we're gonna do a more in fact no we're stick with a batch of eight uh thing is a bit more readable so what we do is it's like super easy so we shuffled data set and we just add in a large number here to make sure it shuffles everything like as far away from its neighboring samples as possible and the sort of standard number here is actually 10 000. i i don't know exactly why but almost every time i've seen shuffling i've seen people use 10k so i'm gonna stick with it and then i'm so used to putting like 64 128 uh we'll just put a batch of eight here so if we take so now we've batched it so we should actually see more than one because it's taking the first like the the the one of the highest level record or batch within the data set so now we should see quite a few and we can see okay cool it's definitely mixed up the uh the phrase ids because these were one two three four before and now they're all mixed up so that's cool we shuffled and batshit like incredibly easily uh so you know that's one of the benefits of doing it and as well i mean writing this code is one it's incredibly easy and simple to remember like it's not hard to remember that uh it's very obvious when you're reading it like what is happening your data set shuffle batch into eight that's super easy i mean maybe some people might get a bit confused by this number here but otherwise super easy to read and it's really quick and efficient so it's it's pretty pretty good next thing i want to show you is the map method so for any more complex data transformations this is probably what you'd use um i mean it's really really useful so what what can we do we can maybe add or multiply everything in the labels by two i mean obviously we wouldn't do this in reality but uh it's just an example and we'll also reformat the the data so we're going to build it as if we have two input fields um so for example when you're working with transformers or a lot of the time you have an input id field or layer and you also have an attention mask field delay and don't worry if that doesn't really make sense but essentially we just have two input layers and fields so we'll we'll format this to have though to be formatted in the correct way to have two inputs and then one output and we'll also you know change the number of the output uh just so we can see how it works so generally the best way of doing this is to create a function so i'm just going to call it map func and pass x so this is just going to pass every single record within our data set uh so one thing i actually just realized is that we should we should batch this afterwards because otherwise it's we have to consider the batching so let's move these after and let's write this so we're going to return and then so when we are working with multiple inputs we or outputs even uh the best way to let tensorflow know where each input is supposed to be going is to give the input layers or output layers and name in the when you're defining the model and building it and match that to the names that we give to this dictionary here so with the transformer example uh i think most people just do input ids and then when i'm just gonna make this up so our input id for this is gonna be uh this value here um and we're gonna put it into a into a list because typically you you'd have like an array or a list of numbers uh coming in here and we're just going to take the first value of x and then the mask and we're going to write it like this and put one and then on the outside of this dictionary we only have one um one label or one output or one target however you want to call it and we're just going to perform like a really basic operation on it just to show that we we can just multiply by two and nothing nothing special and let's play another so to apply this mapping function all we all we need to do again like it's incredibly easy it's data set because data set dot map and then and then we we map the map func like this and then we we did battery before so let's just rerun this bit of the code so that we have it all unbatched again run this and this okay so let's have a look at what we have now so you can see here we have this format let's do that again and see what we have okay it's kind of kind of hard to read but inside a two port right so this is the the index one of the two port and this is index zero all right so in inside index zero we have this dictionary that we defined right which has input ids and then also has the mask and that is what this this tuple format here with the input and the output here is what tensorflow will be expecting when we fit to the model and then if it sees a dictionary in either the input or the output it will read the key values of the dictionary and the values that match to those keys will be provided to a corresponding key so essentially you would have to have a layer called input ids and it would pass this to that layer and then it would also pass this to layer of mask and then we would also have the the outputs being passed to our output layer uh we wouldn't necessarily need to to mark this one out though okay and then we want to batch it like we did before and then we can just view what we have here okay so we have the dictionary and then we have everything else as well okay so that's pretty good that's what we want okay so we need to define a just like really quickly define a model um so it's gonna have a inputs input ids sorry input ids layer okay and so i'm not actually going to define all of this i'm just going to show you how it would work so you define your shape um you define it here and then you would also have your name and then it's this name that would have to match up to the dictionary that we have fed in previously so we go like input id inputs id or was it inputs id or okay so input ids all right they would have to match and then we would have a mask as well because we have two inputs remember and then this one would be called mask and obviously you would call this i don't know mask or anything else you want we've got like input one input two uh it doesn't matter okay but later on when we actually fit the model we do it like this so obviously we all we would have an output here as well um i'm not defining the rest of it but we'd have two inputs and then the output we would have the two inputs something in the middle and the output and then all we'd have to do with that model architecture is it's fitting data set like this and then you'd have however many epochs you're training for um and then that's everything for that so actually there's one other thing that i wanted to to mention as well so obviously a lot of the time you're gonna want to split your data uh for training into like a trust a trust a training and a validation set so to do that it's actually uh super easy so before we mention the take method so we use this.dataset.take and we can also use dataset dot skip and these are like uh equal and opposite so if we do take data set dot take 10 it will take the first 10 batches of the data set and nothing else then if we do data set skip 10 it will skip the first 10 batches of the data set and nothing else so if we uh this is not the most efficient way of doing it but let's just do it like this for now so if we just take the get the length of the data set so i say this isn't efficient because it's to take the a list of the data set we're loading everything in or what generate it because the data set is a generator and we're putting everything into memory as a list and then taking the length of it so it's better just to know how many ma how many batches you're building you know from the start if it's if it's a big data i mean for this it's fine because we don't have a lot of data but normally it would be better not to do that so let's see i mean you can see even with this data set was pretty small it's still taking quite a long time and then say if we want to take a like 70 split so 70 30 70 for the training data 30 validation and probably test later as well but you just split that after um we would take so trading size would be 0.7 and remember this is taking the batches so it would actually be the length divided by the batch size which is eight okay and then we're gonna have to round this to the nearest batch or the sorry nearest integer uh because we can't take 10.2 or something like that so we just round it here and then let's just see what we have for train size okay so we have 1707 batches for the training data okay so we want to take that number of batches so train size and we create another data set which is the train data set and then for the validation data set we just skip those 1707 data sets batches like that super simple um so let's take a length of those so we can see again i know it's not efficient but it's quick it's the easiest way to do it quickly okay so yeah we get 1707 and then our 30 value is when it finally loads [Music] when it finally loads ah okay so so we already know so we already have the batch sides here i don't know why that was really stupid um so we've already considered the batches in here so we didn't need to consider it here as well that's why it's taking so long okay so it would actually be it should actually be that way so let's do that so yeah we have it we have a training size of 13.6 k and then the remaining uh remaining batches will feed into our validation data which will be around 600 values yeah just under sorry just under 6 000 values um but that's everything i think that i wanted to go through so we've covered all of the essentials of the tensorflow dataset object how we can load in-memory data and or read into data sets from file how to batch and shuffle the ones that we read from in memory sources how to transform the data sets with map how we can feed them into models one thing to know that if we just have an input and output like you probably will for most uh for most models it you don't need to do anything you just have it in the you have it in the tuple format and you have inputs and outputs and then you would just do model.fit data set like that you you you don't have to named layers or anything you just feed it straight in and then after that we went through the oh the the splits just done here uh so yeah that's that's everything i hope it's been a useful video and i hope you enjoyed the video thanks for watching
Original Description
Link to updated version (without video freeze): https://youtu.be/f6XVfgJTbp4
An introduction to building better input pipelines for Machine Learning in TF2.
🤖 70% Discount on the NLP With Transformers in Python course:
https://bit.ly/3DFvvY5
Link to tf.data API docs: https://www.tensorflow.org/guide/data
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from James Briggs · James Briggs · 2 of 60
1
▶
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Stoic Philosophy Text Generation with TensorFlow
James Briggs
How to Build TensorFlow Pipelines with tf.data.Dataset
James Briggs
Every New Feature in Python 3.10.0a2
James Briggs
How-to Build a Transformer for Language Classification in TensorFlow
James Briggs
How-to use the Kaggle API in Python
James Briggs
Language Generation with OpenAI's GPT-2 in Python
James Briggs
Text Summarization with Google AI's T5 in Python
James Briggs
How-to do Sentiment Analysis with Flair in Python
James Briggs
Python Environment Setup for Machine Learning
James Briggs
Sequential Model - TensorFlow Essentials #1
James Briggs
Functional API - TensorFlow Essentials #2
James Briggs
Training Parameters - TensorFlow Essentials #3
James Briggs
Input Data Pipelines - TensorFlow Essentials #4
James Briggs
6 of Python's Newest and Best Features (3.7-3.9)
James Briggs
Novice to Advanced RegEx in Less-than 30 Minutes + Python
James Briggs
Building a PlotLy $GME Chart in Python
James Briggs
How-to Use The Reddit API in Python
James Briggs
How to Build Custom Q&A Transformer Models in Python
James Briggs
How to Build Q&A Models in Python (Transformers)
James Briggs
How-to Decode Outputs From NLP Models (Python)
James Briggs
Identify Stocks on Reddit with SpaCy (NER in Python)
James Briggs
Sentiment Analysis on ANY Length of Text With Transformers (Python)
James Briggs
Unicode Normalization for NLP in Python
James Briggs
The NEW Match-Case Statement in Python 3.10
James Briggs
Multi-Class Language Classification With BERT in TensorFlow
James Briggs
How to Build Python Packages for Pip
James Briggs
How-to Structure a Q&A ML App
James Briggs
How to Index Q&A Data With Haystack and Elasticsearch
James Briggs
Q&A Document Retrieval With DPR
James Briggs
How to Use Type Annotations in Python
James Briggs
Extractive Q&A With Haystack and FastAPI in Python
James Briggs
Sentence Similarity With Sentence-Transformers in Python
James Briggs
Sentence Similarity With Transformers and PyTorch (Python)
James Briggs
NER With Transformers and spaCy (Python)
James Briggs
Training BERT #1 - Masked-Language Modeling (MLM)
James Briggs
Training BERT #2 - Train With Masked-Language Modeling (MLM)
James Briggs
Training BERT #3 - Next Sentence Prediction (NSP)
James Briggs
Training BERT #4 - Train With Next Sentence Prediction (NSP)
James Briggs
FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
James Briggs
New Features in Python 3.10
James Briggs
Training BERT #5 - Training With BertForPretraining
James Briggs
How-to Use HuggingFace's Datasets - Transformers From Scratch #1
James Briggs
Build a Custom Transformer Tokenizer - Transformers From Scratch #2
James Briggs
3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
James Briggs
3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
James Briggs
Building MLM Training Input Pipeline - Transformers From Scratch #3
James Briggs
Training and Testing an Italian BERT - Transformers From Scratch #4
James Briggs
Faiss - Introduction to Similarity Search
James Briggs
Angular App Setup With Material - Stoic Q&A #5
James Briggs
Why are there so many Tokenization methods in HF Transformers?
James Briggs
Choosing Indexes for Similarity Search (Faiss in Python)
James Briggs
Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
James Briggs
How LSH Random Projection works in search (+Python)
James Briggs
IndexLSH for Fast Similarity Search in Faiss
James Briggs
Faiss - Vector Compression with PQ and IVFPQ (in Python)
James Briggs
Product Quantization for Vector Similarity Search (+ Python)
James Briggs
How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
James Briggs
Metadata Filtering for Vector Search + Latest Filter Tech
James Briggs
Build NLP Pipelines with HuggingFace Datasets
James Briggs
Composite Indexes and the Faiss Index Factory
James Briggs
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Sub-10ms AI Workflows: Accelerating sim.ai with On-Device Semantic Search using Moss
Medium · Machine Learning
Stop Guessing: Guaranteed Structured Output from LLMs in Node.js
Dev.to · Hardik Mehta
Spring AI Tutorial — Your First REST Endpoint with OpenAI (2026)
Dev.to AI
Notes: Memory, Context, and Large Language Models (LLMs)
Dev.to · Vladimir Panov
🎓
Tutor Explanation
DeepCamp AI