Data Preparation Toolkit for LLM Application Developers | Large Language Models | Community Webinar

Data Science Dojo · Advanced ·🧠 Large Language Models ·1y ago

Skills: LLM Foundations90%Fine-tuning LLMs80%Prompt Craft70%RAG Basics60%

Key Takeaways

The video discusses the Data Preparation Toolkit for LLM application developers, highlighting its importance in crafting high-performing Large Language Models, and demonstrates its usage with various tools and techniques, including data ingestion, quality assessment, and filtering.

Full Transcript

e uh I see we have people joining in let's give them a minute before we go live uh we are now live hello everyone thank you for joining us today uh my name is one and I'm here from data science Dojo um before I introduce our speaker for today I just like to quickly share something that we have been actively working on so basically we have as you can see if if you can just check out like if you're able to see my screen right now um we are currently working on our exclusive large language models foodcast um it is an exclusive boot camp and this is the final one for this year so don't forget to talk to an advisor Orient our info session we are basically uh engaged with industry leaders from top top companies of AI so don't forget to check out our comprehensive curriculum you can find everything that is going on in the world of llms and Ai and uh as you can see that there is this Banner you can definitely get the good pricing for it if you can just book a uh talk to an advisor call with us moreover um before um before we go on to the uh presentation um I just like to introduce for today's session we have shuk da jaad our distinguished research scientist in the Watson uh data Engineering Group at IBM and uh IBM Alan Research Center he has an a very rich background in Edge Computing and data engineering he earned his bachelor's in engineering and he has done his PhD as well um in electrical engineering from magm in University and spent years at IBM TJ Watson Research Center his research Recent research focuses on AI at Edge and data engineering for IBM wats and AI offense thank you so much shuk for taking the time to join us at data science Doo really excited for this talk without any further Ado overly now thank you thank you very much for that introduction as one so I'll go ahead and and and start sharing my my screen um okay so you should be able to see my my screen and I will be putting it in a screen show mode everybody okay you can see this ni okay so so the topic of of our discussion today is is a is an open- Source package that we call data preparation k or in short dpk or data prep kit which is which we are open sourcing and we are bringing it to the community for llm application developers and and I will I will make clear what what I mean by llm application developers so llm application developers we are talking about a much larger community that is doing things more than just creating models because as as you as we all know creation of the models is so expensive computationally it requires so much the resources and so on that there are not that many companies that can create models but on the other hand the application developers who Downstream do fine-tuning do rag do instruction tuning and so on there's a much larger number of of Enterprises or companies who want to do that and and we we think that what we've brought to this to this open source Community it helps uh not only the the model developers and the big companies and and so on because we use it for developing our own models in I in IBM but also it will help uh all the other down the stream application developers so um let me um say a few words so we will talk about the motivation why did we do this what is it how does it work then then hopefully we'll do a couple of demos and and uh we will summarize what what we have so okay so let's let start um about sort of the background so every conversation in AI starts with models and and creation of the models and so on but it ends with data and and uh this is based on a gardner report that 79% of of people who are in this business and then they are considering llm applications and and and so on they they think that identifying data preparation and generation is the most common strategic task by the AI teams and and then the data volume and complexity is the one of the most challenging of of the problems here so on the right I'm I'm I'm just sort of I I we we thought that it's it's kind of cute to remember something that one of the engineers in IBM who worked on the IBM 305 ramach uh computer this is the computer the first commercially available computer that used the random access dis dis dis drive and it was done in 1956 he that time he knew that this is this is his his word garbage in garbage out so if you do not have good data whether whether it's for creation of the models or or down the stream um other applications of this you are not going to get good good results so so let's this is uh and and and data okay so so the data is everywhere it's where where is and when you want to create your own models of course you go to large number of data sources all HTML Pages you crawl the web you do collect from from all the websites you collect from all the PDFs of the world the PDF files of the world you you collect Microsoft Office files of the world so so you are your drowning in in data as as the source for not only for for creation of the models but also so when it comes to do things like fine tuning and instruct tuning and and Rag and and so on you bring in your own data you bring in proprietary data and so on so so it's just the the amount of data that that needs to be processed and and and prepared to create good data that results in good results is is enormous so um let me tell you this this particular slide is about um the way we had to do uh this is a life cycle for just the data that is Gen that is used for in the pre-training stage and this is this is the journey that we took for for creating the so-called IBM granite or Granite whichever way you want to pronounce it granite or Granite model so I'll I'll I'll come back to the to the applications Downstream in the the next slide but just to create for the creation of the models you do starting from the left you start from external data sets IBM internal data sets partner data sets and and and you download and extract all all of these files then you go through the task of of data cleaning and this is this is for for first of all ingesting all all of that data and and the to the schema you put it in some databases and and and so on then then you do exactly duplication because you don't want to carry duplicate documents all the way through you do so-called FID duplication which is which is semantically based on the meaning of of the documents if it's if they are not exactly the same it still the duplicates and it takes some of them out because semantically or meaningfully they are the same then then you do language detection language detection in in the in the sense of if you're doing natural language processing is is NLP type of stuff of course the language here means whether it's English or French or Japanese and and so on if it's if it's in the context of of code it's it's whether this language is C or C++ or Java or or whatever so so then then you do annotating annotation of of of this data and then filtering of this data again for for some of the kind of modules that are used in this are are ha this is hate abuse profanity stuff you want to take those out you want to take out files that based on some well-known metrics and published metrics there are some some things that identify documents as better quality you want to keep the better quality document then throw away lower quality documents in in the term in the sense of language there is a well- defined set of metrics in the sense of code is also another set of metrics for the quality of the code that that you will be carrying out the the the more filtering you do and the more judicious filtering you do you are getting rid of that so-called that garbage in stuff so so you you will expect to get get better results at the end of course after you've done all your filtering and annotations and so on you tokenize it because you want to now send the tokenized the the the tokenized results you want to send it to some llm um engine and and if if you are um if you are creating models you you feed into this so so that you you this is the during the pre-training stage you you create your your models based on that so this is just the pre-training stage to for our creation of the models for based on all of this data now so this is if you go to the the full life cycle that pre-training is only the top right um of of of this picture so so be beyond the pre-training where all that data was being used and and so on now you come to fine tuning instruct tuning and and retrieval augmented generation rag where you may bring in more data more labeled data or more appropriate data more more customized data and so on to to augment your your your models and and and get get better results the point is that all of those filters and annotators that were used in creation of the model for for pre-training the the dup D duplication quality hap license filtering for for code and and so on they come into play and and so that was the whole the whole motivation for us was that when we use all of these modules in in creating the the IBM models we we realize that the same modules can be used by a much larger community of developers or application llm application developers and and and so on so um okay so some of what are what are the challenges in in data data preparation so they are not known upfront and and it's time consuming and cumbersome to to discover the the challenges every use case has its own unique needs and and and manual verification if if the size of the data is large it's in general of course the size of the data is huge in creation of the models and and pre-training and it gets smaller at at the at the on the other side of the spectrum when when fine tuning and Rag and so on can come into play but still even even in the case of of the in that in those stages the the the data is is still large it's it's still we are not talking about handful of data files we are we are talking going from millions and billions of files we are now going to millions and thousands of files on on on the other on the on the other end of the spectrum so um modalities okay so modalities what what we have uh once once I get to the details of what data prep kit is at the moment data prep kit what what we have in the open source it it is looking at two mod modalities of language and code but of course we are adding to this uhu modalities of audio and image and and so on those are all in the works they are not they are sort of internally we are working on them they will be added to to this to this open source repo um okay so uh again because you want to process this large amount of data you need you need automation you cannot go and and run these transforms and so on in a non-automated way so so you want to create uh a low code or no code sort of automation mechanism for running with large number of data files and also sequence of all of these Transformations so so you want to have a a pipeline that that says I do D first then I do the quality then I do filter out specific languages and so on all of those you want to have a dashboard or a UI that allows you to do all of this in an in an automated way so so these were sort of this is this is these are the challenges so uh based on all of these challenges let let's let's now introduce what this data prep kit is is about so data prep kit is an open source toolkit Apache 2.0 license that has all of these recipes that we call them transforms and for for code and language modalities at the moment so at the top right I have the URL for for the for the repo so so you are you you're you're welcome to to to click on this and and and and uh I'm sorry you cannot click I'm clicking just just just to take it down or if you get the presentation later on you you should be able to click on this and it will it will take you to the repo I'll I'll I'll at at some point I'm going to go to the repo itself because I want to show you some of the things in the repo but right now let let's look at some of the the the screenshot of what when when you go to the GitHub this repo you this is the first screen that that you you will you will see uh um the the the interesting thing is that this repo is is being worked on every every day it's very much alive it's it's been it's been live since uh I think May or June of of this year and every day new transforms are being added new code is being added some of it by our own team and some of it by open source contributors and and and so on so and and and one of the B um pluses about this this this kit this data preparation kit uh compared to to others that are in the market and so on is the fact that that we are making it easy for adding your own transform to this so totally Dynamic um repo with with everyone being um everyone contributing to to it um so it as I said it has a growing set of modules and and you can bring in your own module very easily and um we want to encourage the research and development in in in this area and and make it make it easy for every one of those application developers who has a need for cleaning filtering annotating data and so on we want to make it easy for them just to go use our our existing modules if they fit your needs and and if they um and and if you want to if you have something that yourself that you've done and is very useful for for for um for transformation of of of data and and cleaning and and preparation of it we are we hope that you you consider adding it to to this repo so okay um again as I said of course this is this is open source this is on the first page of the repo you also see the this this this picture this picture this is this is there and of course the the QR code here if you take a if you point your camera to it it will Al also take you to the same same repo so um here we are showing some of the modules and and and not only we are showing some of the the modules that existed in the modules in in the repo at the moment they are sort of categorized in a in in in three major categories of um of of data ingestion so data ingestion is is when you bring in either your HTML your PDF or or your code code meaning in in our case when when we've used lots of GitHub repos for for for code that's the data that is brought in for for for code generation and and anything that has to do with with with code uh you it usually starts with uh lots of um um uh GitHub repos and and the URL for those G repos and and and so on but again so this is the ingest part that that you start we use throughout this all of these transform as as we go through one transform to the other we work with parket files and and and of course the the the parket file has a column called contents and the contents is depending on whatever it is that that you are doing is is it's it's it carries the content but it adds or it modifies the content or or it adds adds columns to to the parket format for annotations and and so on again we will look at some some examples of of of of doing this but but um some of the the the one the boxes in Green in the in this picture when when when it it's uh says starts from the left and and some of the ones you will see them in in green they were used for some specific application so they were used for for fine-tuning code and they were used for for things like document chunking and document embeddings and so on they were used for for a specific rag rag example so that that's that's why they are sort of highlighted in in this but there there are of course there are other other modules that for depending on your application and so on you you may want to use um this okay so the um I I referred to this a couple of times that these these these modules have been battle tested when we were creating the IBM the so-called IBM Granite models that are part part of the Watson X family of of products by IBM and though those models they are some of them are open sourced already you will find them on aing phase IBM Das Granite uh and and some of them are are upcoming so the the one the first one that was open sourced was the the code one and and the language one is I I think any day it will be it will be there it may already be there I'm not in that team that that creates the models I'm in the team that that that does the data data preparation and data so-call data engineering team so okay but uh again this is this is the reference where those open- Source models will will be uh the reason for for bringing that up again imagine that that for creation of the models how many terabytes of data and and and how much computation and so on where where where used for when when when every one of these modules that you see up here were seriously battle tested in in in in in that and and that that's why we think that this is this is a very very useful set of tools for for everybody because it's been it's been tested for in we sort of we ate our own cookie cooking um so um uh let let let's look at the modules and run times this is this is important that that uh every single module that that we have there is a a python version of of the of the module that that's that's sort of given that that we have I think with one of them we don't have this I if you look at the table on the on the right this is the profiler I don't think we do have a a simple python one and FID one but those uh it it usually starts with a P py on module and and um and then then you have a array enabled scaled version of of the same module and and and for a few of the modules we also have spark um enabled so so that that you can run on on on on H cluster spark enabled cluster and and so on so we majored in more on Ray than than than Spark we are hoping that based on the documents that that we've provided and instructions that that we provided in the repo more people will will contribute spark versions of of of these but we we did just a few so that that show we show that that our our repo is not is not uh limited to Ray Ray scaling so in you will see one more column to the right which says uh Cube flow pipeline of course that Mo that uh the cube flow pipeline was was created for for the automation of of running these modules and and and so on let me see how I'm doing with with time okay I'm doing okay with with with with time um every single one of of these uh um modules as it shown here and and this is again this table is taken from the first page of the repo so you will you will see when you go to the repo you you will see this table and this table is is is growing so from the time that actually I took took this this screenshot and put it into this presentation uh the if you go to the report live now you will you will see more more uh modules so for instance the H module the the hate abuse profanity um module that that takes care of those and and filters out that that stuff is is is there um so uh when when I go to to the repo to to show you a couple of things we we will see that the table is a little larger than what what you see on on on the right but the the other point that I wanted to make about the table is the fact that if you want to know what every single one of these modules how it works what is it based on what algorithm it's using and then so on you click on the module itself you click on on these are all clicker wall uh on on that table and it takes you to the read me file that that shows shows you some information how what was the source what what did we use we have used open source for many of these these these modules not all of these modules were were created entirely by the IBM team some of them are by the way but we we are open sourcing even those that we created ourselves and we are using some some some code from from others who as as a as a um both as a reference or or in some cases the library the python library that all module is is linking to and it's using so more more than that um there is a link at the bottom of this page this is a technical overview of of dpk and it goes into some details uh this is an archive paper that we just published on archive I think two weeks ago may maybe less than two weeks ago I think Monday of last week and if you if you click on if you go to this paper and and and and go to to to Archive um you you will see the full PDF of the the paper I think it's a 10 page paper that that goes into into some some some details uh okay so let me um let me see what what I have okay I have some stuff for how it works and and sort of some examples and and so on but I think maybe it's a good good good time to stop for a second pause for a minute I've been just talking non Non-Stop and and and and see if there are any um questions at at at this point so I'm at your service before before continuing with with this slide any questions so far anything um I'm I'm I'm here to answer okay so um f and resan if there are questions in the QA and and so on please let me know okay so if there are no questions let's continue with with these as I said uh every one of these transforms is is about either changing the content of of the file based on some some criteria so it reduces the the the number of files the number of of data based on duplication and so on or it annotates and it says this is the quality and so on and you can use one or all of these transforms um in in some sequence that Mak sense for for for your own data so so it starts from R raw data again it's at the moment the what what I showed in PDF or or HTML or or zip files for the repo GitHub repo code and and so on starts from those and then then it goes through every one of these transform from that point on once it ingests those input files and converts them to parket all the other transforms they take a park as input and they take and they have a park as as output up up to the point that that you go to all the way to the tokenization um so let let's look at a particular example so documents language and code are are in P Arrow tables so let let's look at something that that starts from having a bunch of records of URL of what the documents the URL for the documents some unique ID for the document and then the content of that document on the left and and so then then we annotate we use the language ID module that that we have to annotate with the spoken language so so we add a colum which says whether this is English or or Japanese for instance or JP English English Japanese so so now you go and and look at go through the the module that that is looking at the Quality and again this is if you want to know more about how the quality works and what are the metrics that are used to to give it a score between let's say zero and one one being the highest quality zero being the lowest quality and so on the so-call perplexity and and score you you you have to look at the the paper that that we have a or or the readme file for this specific module so um based on that again the algorithms that that we have it it gives a score a good score to the first document which says it's 081 but it gives a low score to to the second one and and of course the Japanese one is say let's say that's a good score too now you apply you go to another module which is called the filter and you say that of all these that I had in the previous St page filter down and and give me only the ones that are in English language and have a quality score greater than 080 and and of course so that's the file that you are kept you you keep and and for the subsequent steps and and so on so okay um I see that uh on the top of my screen a lot of things are are going on in in chat whether I um should be taking care of this uh if you don't mind I I can come back to to the to the questions uh at the end unless somebody wants to do a a quick one that that is related to this specific page and and and and so on right uh so let me pause again for second um um f and resan is there anything urgent that I should answer now or or or I should go to the end okay I guess everybody's kind muted at this point so so I'm I'm I'm I'm sorry about that but um okay let let me come back we will come back to hopefully every question that that you have and everything that is going on on on on the chat um okay so so let me go to the next page um so so this is a very specific example that is done for because I remember I told you that that they've already open sourced our mod our are models the granite model for for code and and for code you will see that that some of these specific um modules for for code like programming language detection this is this is doesn't apply to language applies to code only so this is these were used starting again from from the GitHub data on on the left and and there are some specific things that look at for instance the um license for for the document they are looking at um malware detecting malware in in in in the code and and and these these were all not not every single one of these I don't think for instance are the malware stuff is is in in in the in the open source repo at the moment but uh I'm I'm showing this because to give you a very specific example of how this was used when when we created the granite code um models that the the ones that are on on hugging face and then you you can you can use any any time um so so this was uh just putting this this in the pipeline and and of course many of the modules here are are common between code and language and some of them are specific to code and and some of them are specific to language if you go back to to the um to the to the chart that that we had before uh remember that this is the on the table it says universal meaning that these these these modules that have exact F ddop and ID annotation filter all of these apply to both code and language the the modules below that is language only these are specific to to the language and of course these are specific to code okay so from laptop to Cluster this is this is very very um this is very important for for us um and for everyone who does data preparation you are dealing with with terabytes of data in in in processing or preparing your data and so on so you have to have a way of testing things small on your laptop and and you have to have a very easy way to scale this up and that's why we have those Ray and Spark enabled modules so so we we uh you are able to run this on on your lap even on your laptop reasonably you don't have to be limited to one these days laptops are you can you can do hundreds of of files on on your laptop because we do actually we create in in uh there's full instructions that says that you you install the kind cluster on your laptop so so you do sort of Ray enabled version of the code even on your laptop but but this is using the the kind cluster but then you you can scale this and use a real cluster and and with uh Cloud object storage for for saving files and so on in instead of using your own laptop for for file storage and so on you use um you use cloud object storage and and and and and not only that we also have um the as I said we have the cube flow pipelines for for for for for automation of of of running this quick start guys all the way from U I think I right after this I want to go to the to the repo and show you a little bit the structure of the repo and where where you find help in in in in in the repo so I think this is my last slide before I get to the demos but um let me for a sec and uh let me stop sharing this and let me go and and and and um um I I want to go and share my uh browser next so so that we go to the repo and I show you this but maybe it's a good time to look at some of the the questions is there anything that is I should be answering in the in the chat I guess these are all the stuff from you guys in data science Dojo uh giving giv information and so on on the QA I noticed DD why is D duplication needed to prepare data to train a large language model right so so if it's uh uh if if if you don't have if you don't have the duplication there's a good chance that that you will overfit your your model you you you based on everything that you know and and Ai and and so on you will you will overfit your model if you have lot of not only it reduces the you cannot imagine how many duplicate files are there when you go and and you take you take uh data from all these various sources they all refer to I don't know attention is all you need the the the the the the main paper that that came out for for started the whole ji stuff and LM and and so on that that that paper appears in thousand places you do not need to take this this uh to carry it and then overfit your model based on based on that information so um I hope I I answered that that question Craig yeah um okay um so let me um anything else there yeah not that I okay not that I see at the moment let me uh go and and share my um um um share my my uh browser for for a second because I want to take you guys to uh the repo itself okay so so this is the repo data prep kit on the on github.com IBM you you will find this um so here this is the diagram that I was showing you during my my um my presentation um we have a section say getting started and and the one of the examples that one of the demos that I wanted to show you was was was was by clicking on on on this and if I click on this it will bring me to to um Google collap and and if I um am here let me actually wait but I I should um go back to the data prep kit itself also because we may we may need that uh uh at some point but um that that that clicking on on on this Google collab brings you to here the nice thing about um uh Google collab here is of course you don't need to install anything as long as you have a Google account and and you are able to get to this is this is I have a free account I don't have a paid account on on Google collab and this this allows me to run run and and try out without installing anything on my laptop without in without cloning the repo without doing anything because it it it um pip installs the whole package here so so maybe we should actually uh start this because this is going to to to to take a little while this is one of the demos that I wanted to to show you so so you should go go ahead um and and and and do this maybe actually maybe what I should do before doing this maybe I should change the runtime on on Google collab and and uh and and change my runtime and use gpus so so we make it a little faster okay so so let let's do the PIP installed again and and and and go through so while while it is doing the PIP in of the whole package let me go back to to the to show you a few more things on on the repo itself so then then it gives you instructions how to create your virtual environment assuming that that you want to use cond there's a there's another read me that if you don't want to use cond and you do want to do virtual your own virtual environment and VM and so on there's a way way to do that um then then youp installs the these the the the main package the array enabled version when it has this Ray in in it it means that it has the the python also and and so so there's there the ray one is a super set of both python only and Ray so it will it will install this it will install Jupiter lab for for you and and and it allows you to run your first transform locally on your your your machine so so it will bring you again a few examples I was going to try a couple of these examples but I want to show you if you you for if you want to know where these examples are uh on the repo itself you there's there's a um there's a there's a directory examples and notebooks and in the the in my charts um the when when we when we went from my charts and came to the to the browser if you remember if you just glanced over it I had it up for maybe 10 seconds it said three demos the first demo was that that that Google collab one which is which is still installing the P installing the whole package the second one was actually a fine-tuning example that that would allow you to do to come here to the code and run this sample notebook from from here you you would you would come and you would run this of course you would you would run this after after you have cloned the repo and and we will do that we will hopefully we will we will get to to to show you that that we will run this part particular notebook and and it will allow us to go through one transform after another as relevant to preparation of of of code uh so so this was the um second example the second demo that I wanted to to show you here it is this is this is as part of the the the so the rag folders under the examples and notebooks the other one that that is more interesting and and for more Advanced users is is is is the rag example and the rag example again it starts from it has multiple um multiple U notebooks uh and and it will it will start from it has the usual rack process which the usual rack process starts from a vector DB and and then then adding some sort of a prompt and then then then you do the embedding of the query and and and and then then you you you pass that to some llm engine and then and you it will give give you give you the best the best result or the best answer to your prompt um that the the original llm would not have given you so so this this is this is this is this is a more appropriate and and more customized answer to your question so everything that is in the green for for you guys who know rag is the standard rag stuff but but what we've added to the top we are saying that that if you do cleanup and the pre-process data here and and you use some of our modules for for splitting into chunks and vectorizing these are all specific modules in in and uh if you go through all of this before you loading your data into the vector database the results is going to be better and then I say this is for you to go through this whole example and and and and and try it out and and Hope hopefully there's there's plenty of instructions there to show you how how to do this but let me go back to uh yeah okay so so this this this guy uh um the it ended it pip installed everything and and so we restart the session here and and and now we can we can go and and and create an input data directory and output test data this this one is a very simple stuff which is which is going to download on the Fly um some some some files from archiv one of them is the granite code code models this is the paper that that from IBM uh the it it shows the details of this and then this is the famous attention is all you need paper and so on so it downloads those and and it wants to now go and extract from from this I will I will uh let it let it run uh again uh to to to execute that that particular um PDF to parket transform and we will we will come back so so you you will see it it running here on on in the background on on on Google collab but while it's doing this and and and we will come back to this let let's go back and I want to now go and and and try uh the the one of the examples that I was showing you before uh which is which is based on this find tuning of code and and this particular notebook here let's let's take a look at it before actually I do anything that does may may break and usually live live demos there's a there's a tendency to break but let me describe what what it does before so so again this this this this one which we will be running on on my laptop and and we will be monitoring it on my laptop it's it starts by by pip installing the the a version of the transform and it it does does pp install some some data sets and on Panda and so on then then that in it it uh it Imports the particular trade transform launcher that that that we need we from from from from our our package and and and then it starts working with a local uh directory called sample data so so you will see that it starts from um the PDF files that are are coming they they you we are downloading from hugging face we we are downloading these particular data sets from hugging face and then use our mod module to create a parket version I think there this this example downloads 19 um 19 files 19 data sets and then then converts them to par so so you will see 19 Park files generated as if run this notebook live and hopefully I will do that for you in a in a minute then it goes through uh I think that the beginning it may have uh yeah it it explains that that it reads the data set from hugging space converts it to to to the the park data format then it exact does exactly duplication then it does fuzzy duplication then it uses programming language selection module then it uses the code quality annotation that that identifies this this particular file has higher quality this than than the than the others and and so on and then it filters them down and then it does semantic ordering of the repo based on the the the the repos that that are in again you you will see for we have different criteria for for ordering the repos these are the C repos and these are the Java repos and and and and and so on and it finally tokenizes them and and makes them ready to be fed into into some some some some engine so whether whether that engine is is for for pre pre tuning stage if for millions of files or or it's a smaller scale number of files that are are needed for for for fine tuning in this this particular case is is fine-tuning so so it doesn't do that that many files as I said in this particular particular exam I think it's 19 files and and so on 19 highly relevant files that are are used during the fine fine tuning and are labeled and and and so on that that help with fine-tuning what comes out of the the llm engine okay so um yeah is this already done or okay this is this should be done in a second or so in in and and uh maybe we can come back to to this I don't know we we have uh we have like maybe 10 minutes L than like 10 12 minutes left uh whether it's better for me to do a go and do that that notebook that I showed you run it live uh by by sharing my my terminal screen and looking at the directories that are created as as each transform is run so so you start from zero nothing in the directory of sample data that that it it starts from it starts from so-called sample data that is is is the first first first first directory and then then every single one then uses the result of the previous filter as its input and and creates an output directory so as you go through this the number of of directories is increasing to the end that the final result is the directory that has the tokenized uh the tokenized files in in in it it so nothing nothing magical about this uh but I I highly encourage you to to to try this this out and and and and give us hopefully some some some feedback are we giving the enough instructions in our readmes and so on I do want to also begin come back to some of the key differentiators that that I brought up in the beginning which is which is adding your own transform we have we are on the repo uh let let's let's go back to the repo um the the first page of of of of of the repo and and look at um look at some of the help and and and and so on that is is is is there for uh uh for for example transform tutorials are here so you have a tuto step by step tutorial how to add your own transform you you click on this and it will it will bring you to to to this it it will do it with the So-Cal no op transform a transform that takes the input and and and it the output is the same as the input it does no operation on on it but it takes you through all the steps that are needed for real transforms that all the real transforms follow the same same pattern so so this taking this this uh tutorial it helps you to to create your own your own transform and of course example transform and remember that I I said that if you want to read I I usually get a question is how does fuzzy D work and and and if you click on the Fido filter and and you get to the readme file for the FID it it kind of it says that it's using the Minash algorithm more details are are shown here this is sort of in simple terms how how it works and and you you have equivalent of of this if if you go back to the table every single one of these if you want to know how it works you click on it and it will take you to the appropriate read me file I remember that I said that my table in my presentation didn't have the ha filter in it yet now and this is the live live version H hap is been added um to the list list of the transforms and there's only the python version of it right now if you look at the the repo and the the the PRS that are created you you will see that I think if you do the PRS you you will see a a aray based version of that same Hab transform is is is now it's a PR which is being reviewed and and and so on so so you can expect in a day or two once this review is finished that this one on that table there there's a check mark in front of H uh Ray enabled version version of the H let me um let me stop sharing here I think I I prefer that that I um stop and and don't do uh spend my time on on live demos and and so on and hopefully um be able to answer questions or or and so on uh I do have one more slide left which is which is summary slides uh summary slide and and maybe I just share that and put it on the screen for for a second uh and then then then come back uh because you you will you will be able to get these these charts I uh uh hopefully you you will have these these charts so so if you go if I go and look at um uh the my my charts uh I think the last be after the demo is that that we have a high quality mod models start with high quality data data is challenging challenging and then solution for all of this is something called Data prep kit as an open source Community project we want people to contribute and so on so that's that sort of the the the summary of everything that I've been I've been saying let let me stop here for and and and and entertain questions comments and and and so on okay so either everything that I was saying was so trivial and and and and so obvious to everybody that that you don't have questions or or I didn't explain well enough that in any case there's one you you are you are here so um feel free you can you can create issues you can create you can communicate with us through issues to the to the to the to the uh um to the for itself we do have a uh we have set up a Discord Channel I believe that that we have I I think uh we we should we should sort of advertise about the Discord channel on the first page of the repo that that the Discord channel will also is a place for collecting comments and inputs and and and and so on right uh sh can you just check your chat box uh there are a few questions that you may be answering for okay on the QA or or on the chat the chat the general chat yeah okay okay let me okay so question in the okay that mean a lot of it in the beginning is a about uh okay this being record okay so in the context of correctness the semantic layer on the deliver data how accurate the code in llm interpreting many languages yeah so again the one that the modules that we have released on on this repo itself um are you talking about the the natural language not the the programming language I or or programming languages so so the the one that that what's on the open source right now from natural language from spoken language point of view English only although we have of course used other languages when when we were creating the the the granite models and and so on and in terms of code of course we are 123 I mean programming languages and so on are are addressed and and are are recognized and are annotated and and and and so on and of course the the better the when when you use that that filter and you use the appropriate language coding programming language down the road for for fine tuning and so on the better result you get I'm hoping that that you you um that I've answered your your your question I wasn't sure is programming language or or or or spoken language but in case programming language lots of languages spoken language this repo the Open Source One is English only at the moment yeah okay anything else that I am uh isther from Fairfax okay okay thank you for your comment okay [Music] um is there anything else that I'm not seeing is one that you see I don't see any other anything else yeah okay okay so you have to yeah use the repo and rewatch okay thank you yes yes I'm I'm sorry it was just the sort of the intro to this and I highly encourage all of you who are interested to go to the repo Try It Out Try examples and initially the examples that don't require anything for you to to to install call and run and and you you can run and try out some of these using the Google collab that that I have by not installing a single thing on your laptop everything in the browser would be taken care of and then once you do that then go and clone the repo and and step by step get the first example second example more complex r one if you are a more advanced uh data scientist and or or the depending on your level of expertise we we have we have for every everybody we have things for everybody to try and learn and contribute and and so on right I think we are done with questions right right uh if anyone has any questions they can definitely ask sh right now or definitely we will be getting the recording of the session as well after so don't uh don't forget to get that other than that thank you so much shuk for very insightful talk it was truly great to have you in a session and it was very insightful so before we log off and say goodbye to our audience I just like to show us our community so you can uh feel free to join our Discord Community where you'll be getting access to the exclusive content that we usually post on for Discord as well as well as for for our socials it's uh it's very exciting since we uh we organize giveaways moreover we have all of our recordings as well we have events um set up here as well and the recording of this session and the previous sessions are all available so don't forget to join our community and the link is shared with you on the chat box and with that being said thank you for joining us thank you to our audience I'll see you in the next events thank you so much thank you my pleasure thank you bye-bye

Original Description

In the world of AI, conversations often revolve around models but conclude with data. As the Generative AI landscape evolves, data preparation has become a critical phase in crafting high-performing Large Language Models (LLMs). The success of LLMs hinges on the quality and quantity of the text and code corpora used during their training. The data preparation phase is essential for cleaning, filtering, and transforming datasets into a tokenized form, suitable for either pre-training or fine-tuning LLMs. Key Takeaways: • Discover how DPK fosters collaboration within the AI community. • Learn how DPK can accelerate your development process and reduce time-to-value. • See how DPK has been a driving force behind the IBM open-source Granite models.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Data Science Dojo · Data Science Dojo · 0 of 60

← Previous Next →

Feature Engineering and Predictive Modeling | Data Analytics with R and Azure ML | Community Webinar

Feature Engineering and Predictive Modeling | Data Analytics with R and Azure ML | Community Webinar

Data Science Dojo

Data Exploration and Visualization | Beginning Azure ML | Part 3

Data Exploration and Visualization | Beginning Azure ML | Part 3

Data Science Dojo

Reading External Data Sources | Beginning Azure ML | Part 2

Reading External Data Sources | Beginning Azure ML | Part 2

Data Science Dojo

Importing Data, Accessing, & Creating a New Experiment | Beginning Azure ML | Part 1

Importing Data, Accessing, & Creating a New Experiment | Beginning Azure ML | Part 1

Data Science Dojo

Casting Columns & Renaming Columns | Beginning Azure ML | Part 4

Casting Columns & Renaming Columns | Beginning Azure ML | Part 4

Data Science Dojo

Scrub Missing Values & Project Columns | Beginning Azure ML | Part 5

Scrub Missing Values & Project Columns | Beginning Azure ML | Part 5

Data Science Dojo

Feature Engineering & R Script | Beginning Azure ML | Part 6

Feature Engineering & R Script | Beginning Azure ML | Part 6

Data Science Dojo

Building Your First Model | Beginning Azure ML | Part 7

Building Your First Model | Beginning Azure ML | Part 7

Data Science Dojo

Run and Fine-Tune Multiple Models | Beginning Azure ML | Part 8

Run and Fine-Tune Multiple Models | Beginning Azure ML | Part 8

Data Science Dojo

Deploying Your First Predictive Model As a Web Service | Beginning Azure ML | Part 9

Deploying Your First Predictive Model As a Web Service | Beginning Azure ML | Part 9

Data Science Dojo

Using R API to Obtain Predictions From Your Web Service Beginning Azure ML | Part 10

Using R API to Obtain Predictions From Your Web Service Beginning Azure ML | Part 10

Data Science Dojo

Using Python API to Obtain Predictions From Your Web Service | Beginning Azure ML | Part 11

Using Python API to Obtain Predictions From Your Web Service | Beginning Azure ML | Part 11

Data Science Dojo

Twitter Sentiment Analysis | Natural Language Processing | Community Webinar

Twitter Sentiment Analysis | Natural Language Processing | Community Webinar

Data Science Dojo

Listening to the Melody of the Universe (LIGO Gravitational Waves Presentation) | Community Webinar

Listening to the Melody of the Universe (LIGO Gravitational Waves Presentation) | Community Webinar

Data Science Dojo

David Wechsler on the Impact of Data Science Bootcamp

David Wechsler on the Impact of Data Science Bootcamp

Data Science Dojo

Andrew Choi on the Impact of Data Science Bootcamp

Andrew Choi on the Impact of Data Science Bootcamp

Data Science Dojo

Microsoft's Software Engineer Shares Her Experience with Data Science Bootcamp

Microsoft's Software Engineer Shares Her Experience with Data Science Bootcamp

Data Science Dojo

Michael DAndrea on the Impact of Data Science Bootcamp

Michael DAndrea on the Impact of Data Science Bootcamp

Data Science Dojo

Data Driven Decision-Making with Data Science Bootcamp: Artem Kopelev's Revelation

Data Driven Decision-Making with Data Science Bootcamp: Artem Kopelev's Revelation

Data Science Dojo

Learn the Fundamentals of Data Science: Srinivas Rao's Experience with Data Science Bootcamp

Learn the Fundamentals of Data Science: Srinivas Rao's Experience with Data Science Bootcamp

Data Science Dojo

Re-Learning Data Science with Data Science Bootcamp: Analyst's Revelation

Re-Learning Data Science with Data Science Bootcamp: Analyst's Revelation

Data Science Dojo

Scale R to Big Data with Hadoop & Spark | Community Webinar

Scale R to Big Data with Hadoop & Spark | Community Webinar

Data Science Dojo

Enhancing Skills with Data Science Bootcamp: Sharon Lane-Getaz's Revelation

Enhancing Skills with Data Science Bootcamp: Sharon Lane-Getaz's Revelation

Data Science Dojo

Ryan DeMartino on the Impact of Data Science Bootcamp

Ryan DeMartino on the Impact of Data Science Bootcamp

Data Science Dojo

Software Engineer at Microsoft Reveals About His Experience with Data Science Bootcamp

Software Engineer at Microsoft Reveals About His Experience with Data Science Bootcamp

Data Science Dojo

Wade Wimer on the Impact of Data Science Bootcamp

Wade Wimer on the Impact of Data Science Bootcamp

Data Science Dojo

Analyzing Data with Data Science Bootcamp: Hannah Richta's Revelation

Analyzing Data with Data Science Bootcamp: Hannah Richta's Revelation

Data Science Dojo

Applying Data Science Skills to The Current Role with Bootcamp: Marcos Lacayo's Revelation

Applying Data Science Skills to The Current Role with Bootcamp: Marcos Lacayo's Revelation

Data Science Dojo

Lance Milner on the Impact of Data Science Bootcamp

Lance Milner on the Impact of Data Science Bootcamp

Data Science Dojo

Deloitte's Data Scientist Revelation: Learning Predictive Analytics with Data Science Bootcamp

Deloitte's Data Scientist Revelation: Learning Predictive Analytics with Data Science Bootcamp

Data Science Dojo

Rajesh Patil's Experience at Data Science Bootcamp As an Enterprise Architect

Rajesh Patil's Experience at Data Science Bootcamp As an Enterprise Architect

Data Science Dojo

Michael Atlin on the Impact of Data Science Bootcamp

Michael Atlin on the Impact of Data Science Bootcamp

Data Science Dojo

Amina Tariq's In-Person Experience at Data Science Bootcamp

Amina Tariq's In-Person Experience at Data Science Bootcamp

Data Science Dojo

Ceo's Revelation about Data Science Bootcamp

Ceo's Revelation about Data Science Bootcamp

Data Science Dojo

Stephen Miller Describes His Experience at Data Science Dojo's Bootcamp

Stephen Miller Describes His Experience at Data Science Dojo's Bootcamp

Data Science Dojo

Kevin Hillaker on the Impact of Data Science Bootcamp

Kevin Hillaker on the Impact of Data Science Bootcamp

Data Science Dojo

Marko Topalovic's Experience with Data Science Bootcamp

Marko Topalovic's Experience with Data Science Bootcamp

Data Science Dojo

Text Analytics With Python, Cognitive Services & PowerBI | Data Analytics | Community Webinar

Text Analytics With Python, Cognitive Services & PowerBI | Data Analytics | Community Webinar

Data Science Dojo

Unisys Manager's Revelation: Visualizing Real Time Data with Data Science Bootcamp

Unisys Manager's Revelation: Visualizing Real Time Data with Data Science Bootcamp

Data Science Dojo

Learn Data Mining with Data Science Bootcamp: Ryan LaBrie's Revelation

Learn Data Mining with Data Science Bootcamp: Ryan LaBrie's Revelation

Data Science Dojo

Vang Xiong on the Impact of Data Science Bootcamp

Vang Xiong on the Impact of Data Science Bootcamp

Data Science Dojo

Data Scientist's Experience at Our Data Science Bootcamp

Data Scientist's Experience at Our Data Science Bootcamp

Data Science Dojo

Alejandro Wolf Yadlin on the Impact of Data Science Bootcamp

Alejandro Wolf Yadlin on the Impact of Data Science Bootcamp

Data Science Dojo

Introduction To Titanic Kaggle Competition | Part 1

Introduction To Titanic Kaggle Competition | Part 1

Data Science Dojo

Learning How to Code in R with Data Science Bootcamp: Priscilla Mannuel's Revelation

Learning How to Code in R with Data Science Bootcamp: Priscilla Mannuel's Revelation

Data Science Dojo

Andrew Berman On Why Data Science Bootcamp Is Better Fit for Him

Andrew Berman On Why Data Science Bootcamp Is Better Fit for Him

Data Science Dojo

How To Do Titanic Kaggle Competition in R | Part 3.1

How To Do Titanic Kaggle Competition in R | Part 3.1

Data Science Dojo

How to do the Titanic Kaggle competition in R | Part 3.1

How to do the Titanic Kaggle competition in R | Part 3.1

Data Science Dojo

Delve Deeper into Data Science with Data Science Bootcamp

Delve Deeper into Data Science with Data Science Bootcamp

Data Science Dojo

Bank of America Data Scientist Reveals His Experience of Data Science Bootcamp

Bank of America Data Scientist Reveals His Experience of Data Science Bootcamp

Data Science Dojo

Shaena Montanari on the Impact of Data Science Bootcamp

Shaena Montanari on the Impact of Data Science Bootcamp

Data Science Dojo

Types of Sampling | Introduction to Data Mining | Part 12

Types of Sampling | Introduction to Data Mining | Part 12

Data Science Dojo

Sampling for Data Selection | Introduction to Data Mining | Part 11

Sampling for Data Selection | Introduction to Data Mining | Part 11

Data Science Dojo

Data Aggregation | Introduction to Data Mining | Part 10

Data Aggregation | Introduction to Data Mining | Part 10

Data Science Dojo

Data Cleaning | Introduction to Data Mining | Part 9

Data Cleaning | Introduction to Data Mining | Part 9

Data Science Dojo

Missing & Duplicated Data | Introduction to Data Mining | Part 8

Missing & Duplicated Data | Introduction to Data Mining | Part 8

Data Science Dojo

Data Noise | Introduction to Data Mining | Part 7

Data Noise | Introduction to Data Mining | Part 7

Data Science Dojo

Graph and Ordered Data | Introduction to Data Mining | Part 5

Graph and Ordered Data | Introduction to Data Mining | Part 5

Data Science Dojo

Document Data & Transaction Data | Introduction to Data Mining | Part 4

Document Data & Transaction Data | Introduction to Data Mining | Part 4

Data Science Dojo

Data Quality | Introduction to Data Mining | Part 6

Data Quality | Introduction to Data Mining | Part 6

Data Science Dojo

The Data Preparation Toolkit is a crucial component in developing high-performing Large Language Models, and this video demonstrates its usage and importance in data preparation, fine-tuning, and retrieval augmented generation. By following the steps and using the toolkit, developers can improve the quality and performance of their LLMs.

Key Takeaways

Download and extract external data sets
Ingest data into a schema
Remove duplicate documents
Perform semantically-based duplication
Detect language
Annotate documents with language ID module
Apply quality module to give a score between 0 and 1
Filter down to English language documents with quality score > 0.80

💡 High-quality models start with high-quality data, and the Data Preparation Toolkit is an essential tool for LLM application developers to prepare and fine-tune their models.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

Stop Guessing: Guaranteed Structured Output from LLMs in Node.js

Learn to guarantee structured output from LLMs in Node.js and stop parsing JSON manually

Dev.to · Hardik Mehta

Spring AI Tutorial — Your First REST Endpoint with OpenAI (2026)

Build a REST endpoint with Spring Boot 3 and OpenAI to create an LLM-powered API, leveraging the power of AI in your applications

Notes: Memory, Context, and Large Language Models (LLMs)

Learn how memory and context work in Large Language Models (LLMs) and potential improvements

Dev.to · Vladimir Panov

10 ChatGPT Prompts for Job Seekers: Resumes, Interviews & Career Growth

Learn how to leverage ChatGPT for job searching, resume building, and career growth with 10 actionable prompts

Medium · ChatGPT

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)