Building your own AI text generation tool with aitextgen using GPT-2/GPT-3

Imaad Mohamed Khan · Intermediate ·📰 AI News & Updates ·4y ago

Key Takeaways

This video demonstrates building an AI text generation tool using aitextgen with GPT-2/GPT-3 models, covering installation, text generation, fine-tuning, and model training on news headline datasets.

Full Transcript

hello everyone welcome to yet another video my name is imat and in today's video we will take a look at ai texture python package which is written by max wolff who's a data scientist at buzzfeed uh this is a robust python tool for text based ai training and generation using gpd2 and gbt3 so you can use either of the models and create your own text-based ai generation tool right sounds sounds really fun sounds really fascinating in this video i will take you through how you can start using this package right away so there is some readme you can definitely go and check it out i have taken some of these things and tried to show you around how you can [Music] start creating your own machine or your own text generation tool i would perhaps also need to open ai textgen dot py later on when i am trying to go through some of the parameters but it's good for now let me quickly get started with the demo so the first step with any uh python package is to install the package right so in this case as suggested in the readme we will use the pip command to install it and if you're working in a project you might want to do it on the command line but i will go ahead and show this entire demo on this google collab in fact this is it's going to be uh two google collabs one google collab i have already run some some some some training data and i i will perhaps switch to that later but for now uh i will show you over here so this is this is a brand new uh collab there's nothing run on this i'm going to go ahead and run this and let's install ai texture first okay so it behind the scenes uh as mentioned in the readme where is the readme yeah it leverages by torch hugging phase transformers and by torch lightning with specific optimizations for tech generation using gp2 plus many added features uh again maxwell's has other text generation tools like text in rnn and gp2 simple and this is supposed to be like a next level of or rather an improvement over the other two tools and basically what you can do is you can fine tune on pre-trained 124 am 355 m 774 m uh parameter gp2 models from open ai or you could use the models from luthier ai gpt neo slash gpd3 so you could do that or you could also use the pre-trained model themselves which we will initially use uh in the demo so i hope the installation is done already okay there's an error uh pips dependency resolve does not take occur you must restart the runtime in order to use the newly installed versions let's restart the runtime there's no local variable yet so no problem if i restart did i restart i did restart okay i think it must have restarted and now let me run all right this time successfully it has run and we have perhaps managed to install ai text chen already so once the installation is done we can now start using the package it's as simple as that and the way we are going to do that is we are going to use the gpt rule model for the purpose of this demo uh we're going to create an instance of the ai text gen class that you will see here if that's why i had opened it earlier so let's class a text gen and we're going to create an instance of that and pass different parameters in this case we're going to pass the model gp2 parameter and we're going to reference this variable going forward this instance rather not variable so let me just quickly run that to create an instance of this a8x10 class which will download the smallest gp2 model onto the local okay i have a i now and i'm going to use this ai instance to generate and there is this generate function that i will that is used often over here okay and the reason why i'm here is to show you the number of parameters that you can set specifically we will take a look at the stock string so it says n number of text to generate and we say 2 in this case prompt is basically the text to force the generated text to start with so if i give a prompt the statement will always start with this given prompt maximum length is the maximum length for the generated text temperature determines the creativity of the generated text do sample samples the text which is what we want if false the generated text will be the optimal prediction at each time and therefore deterministic return as list is a boolean which will determine if the text should be returned as a list if it is false the generated text will be printed to console i think in this case we will want to print to console but if you are say implementing an api you might want to save it in a list seed numeric c which sets all randomness so these are the different parameters you can play with in this case i'm going to set n equal to 2 prompt say indian government max length is 350 temperatures 0.7 do sample true which is what we want return is list true or false in this case let's let's just print it and see it 42 and this should already be it it should automatically start generating sentences that start with indian government now what if you don't want to prompt and i will perhaps quickly show you that as well by creating that now you want the ai system to generate without any prompt provided right so we can also run that let's just wait for the other cell to run and then we can run this and basically i don't give any prompt and then it randomly picks up a few sentences and then [Music] generates it okay so this is what we get so we get two sentences with indian government as the prompt indian government has not been able to convince the world that it has any interest in doing what the united states is doing as a partner and can lead the world in providing clean energy to meet the needs of a growing country so what's the problem i'm not talking about the fact that there is no single global leader in the united states is not one of them i'm talking about the fact that the united states has made very significant strides to the last decade energy independence of cutting emissions in nearly half and this is so coherent that you could easily mistake this to be written by a human so that is one paragraph maximum 350. another paragraph is indian government has asked the center to take a look at the issue it is an issue that is not easily addressed as it is not a matter of debate it is a matter of policy but the government is not acting on it we should not be doing that a petition is filing is the central government to have it removed from the list of organizations linked to kashmiris the petition which has been filed against the central government in the lok sabha asks the center to take a look at it issue it is an issue that is not easily addressed as it is not a matter of debate it is a matter of policy now you see that it the the model sometimes goes in this uncanny valley where it starts repeating things and this has been the case with some text generation models so like we saw in the previous example uh or rather the first example we saw was quite coherent without any repetition in this case this is not a sentence that you can use directly as it is so what can you do so and that's one of the drawbacks of some of the older text generation models where you use where you had the problem of the model getting stuck in a loop where it kept repeating uh its output um this has been solved to some extent uh in in the next uh further iterations of the model but this is a known problem and usually in this case what you do is you change the random seed and then you look for another uh generation with the same prompt let me quickly show you the other generation without any prompt and i this one will run quickly so this will this can be anything that the model wants to randomly generate there's no prompt it might not be even about the same thing right and again i'm keeping n equal to 2 because i want two examples so this is already it right you can use the pre-trained model and start generating the sentences that you want with a prompt in this and then we'll see now without a and you can also vary the temperature around i will not do that right now but as you move uh up and down okay there's an output so for me the hardest thing was trying to understand why i was do why i was not doing what i was doing at all there's no one else who would think i was doing it the question was posed to him by his mother was pregnant with his son before he was born he said so there's sort of a dual going on here and then the second statement is categories category select category aim and music history drama dancing drama dancer dancing dancer dance performance i think these are different categories that you will want to select on a screen on an oscar show i guess and again we are seeing the same problem in this case that we saw in the previous example but now you see two different and completely random sentences okay so that is all about using pre-trained gpt2 to generate sentences with and without a prompt but what if you this is not enough and you have already a style and already a way in which you want your model to generate text and that is what we will see in the next part uh which is basically fine cleaning gpd2 in this case but you can also use gupt3 or any other model that you have for your data set and then retraining it okay i will yeah so in this case what i did was uh i went to kaggle and got the indian news headlines data set this is what it is and that's why i've kept this tab open so there is this 20 years of excuse me so there is this 20 years of headlines focusing on india i think i can show you a little bit of data here yeah headlines text so you have stereo scope will not be disturbed ayodhya says what spy treasures in hariyath over pak visit america's unwanted heading for india for big wigs it is destination goa etc etc so all these different news headlines right so that is what we will train our model uh to be able to generate text as if we are going to train the style of generation as somewhat similar to generating a news headline sounds fascinating right so far we've whatever generation we've seen is not in that format but you will see that we will be able to train our model to start generating text in that format so what i went ahead and did was basically downloaded the csv from there uploaded it to my google drive and then i'm reading that csv here so let me go ahead and quickly read that file not found that's perhaps because i have not mounted or have i not i need to mount yes connect to google drive yeah now you can see drive here i'm not going to open the contents there because it's my drive okay so um i think that issue should be gone and yes it's gone because i have the csv already in my drive okay so we have publish date headline category and headline text three different columns and uh we have over 3.4 million records um yeah and we are going to take a look at what how the data looks we already saw it on kaggle but it's the same again inside this notebook after i have read it so what's more important to us not the public state or the headline category but the text because we want our system to generate text in this format right so yeah again just to see how many different kinds of headlines are there about 3.1 million which is not a bad number at all okay so and that's that what you i i will i have done i will not run this or maybe i can run this that's fine so what i've gone and done now is converted headline text or rather saved a text file with only the headline text because the fine tuning expects your training data in the form of a text file right so i've just done that i've just taken headline text out and put it in a text file and now we will start looking at the training data or rather the training process so ai text gen already has token data set tokenizer it has a set of utils and of course ai texting what that we've seen already so we import all of these different things and i will again quickly take you through line by line this is uh an example that is already there on the read me of ai tex gen so it's not something that i've written myself but i've i will take it through this with the data that i got from kaggle so yeah my file name is uh my headlines text is stored as file name or rather i'm storing the name of the file with the variable name file underscore name or rather the path right so and then i will pass this to a train tokenizer which will train a custom bpa tokenizer on the downloaded text once that is done it will save a file which is ai text gen.tokenizer.json which will contain the information that is needed to rebuild your tokenizer okay so this is uh the name of the tokenizer file that will be stored after train tokenizer finishes right so the central idea is your text comes and then there is a tokenizer that creates tokens uh which could be used by the model to learn so your text cannot be used in raw format right so that's why you need to tokenize it and then use that there's also a config file and in this case gpt2 is config cpu which is a mini variant of gpd2 which is optimized for cpu training uh and yeah in this case for example the number of input tokens is 64 versus say 1024 for base gpt 22. so trying to make this run faster so if you just call this instance here and assign it to config you will be using this to fine-tune your model and then finally once you have your tokenizer file and if you have your config you will pass that to ai text 10 which you've seen earlier as the two different parameters right and then you instantiate a i then you want to also build your data set for training by creating token dataset so and for this you pass three different variables you pass your file underscore name or other three different parameters file underscore name tokenizer file and your block size so all of these three go as input to token data set and that will give you your data that will be passed let me just quickly run this now and then talk more about i should have actually run that earlier but that's okay so and then you take this data and then you pass this to your ai dot train function and you set your batch size to eight number of steps is basically the number of uh steps you want to iterate uh the model for generate every and save every right so uh what happens is you start training the model uh for 25 000 steps and periodically you will want to save the pytorch underscore model dot bin so that's why you say save every five thousand steps uh a python underscore model dot bin file right and after completion push it to the trained model folder so that is what is going to happen in this step so i have actually done this for 25 000 steps in [Music] my another google collab that you see in this case i will train the data for you or i'll train the model for you but i will not do it for 25 000 steps because it will take around 45 minutes and we don't have that time here so i can change this to just hundred and i want the pythagorean model that bin to be generated after 100 and save every equal to 100 as well because this is just a demo so what do you expect with this as you expect because the number of passes are less you expect the quality of the fine tuning to be bad you will perhaps and not perhaps i think definitely see a lot of gibberish and uh that is that is something that it's not very encouraging but this increases or this gets better as you increase the number of steps so we will run it for 100 steps and we will see gibberish but then i will also switch to the other collab and then we will see some interesting results and then we will finish this video meanwhile we are having the data ready i can quickly show you token data set i don't think it's a part of this yep token data set where did we get to okay token data set is separately over here yeah so class token data set is the class that merges text data set and line by line text data set from run language modeling.pui what are the different parameters that it takes it takes file path which is a string indicating the relative file path of the text to be tokenized or the cached data set in this case which we've already done we've given the file name as the file and then tokenizer file it expects tokenizer file if you see tokenizer file so it expects tokenizer file as well and block size would also be there so 1024 is default but like we saw earlier we are using a different version of gpt2 config so we are changing that to 64. and i think we are about 60 percent in to building the data set and this is also the reason why i didn't want to show you the training part which i will not show you eventually uh for more number of steps because it will take time this also is taking time because of the size of the data set that we have right so another way to speed this up would have been to reduce the size of the data set like we saw earlier i have about 3.1 million headline text so yes we wait another interesting thing that i have not shown examples but i can also talk about is the different functions in ai text 10. so there's another function called generate 1 which is a pretty cool function so instead of using generate you can go ahead and use generate 1 which will generate a single text so there's no n equal to 2 or n equal to 5 or anything but you just say generate 1 and then [Music] you get one text uh uh as as returned from this function so yeah like the function docstring says it generates a single text and returns it as a string useful for returning a generated text within an api and essentially if you see the way it has been implemented is that it calls generate with n equal to 1 and return as list equal to true like i said if you want to and then it takes the zeroth indices zero zero index not index zero index to say that this is the sentence you would like to use right so if you just want one sentence and nothing else then generate one is what you would like to use there are other generate underscore functions generate underscore samples generate underscore 2 underscore file which will generate a bulk amount of text to a file into a format that is good for manually inspecting and curating the text generate samples will print multiple samples to console at specified temperature so that you can see different samples and decide which temperature works well for you so there are a lot of different functions in a i text 10 that you could explore i have just shown you the generic generate function and i think yeah we have the data set ready and we are ready to train the model number of steps is 100 let's hope it quickly gets over you like i said please don't expect the model to perform well or i i'm expecting it to be a lot of gibberish but what you need to do to improve the performance is just to change the number of steps the default that the author used uses or has shown in the example is fifty thousand the one that i have trained is two thousand i'm not really sorry five thousand okay and that's it right so you have the model ready saving model oh i think i yeah so i've generated and then now i can read it from here and start generating with this prompt but like i said it's all gibberish so what i'm doing now is i'm again creating an instance of this ai text in class with the model folder from where i want to read so train model is where like i said the model would be stored you will have two files config.json and pi touch underscore model dot bin so this is the binary file and the json file and you also provide the tokenizer as input to your ai text gen and then your instance is ready and then you use the generate function with all the different uh parameters and start generating now even though this is gibberish i don't know if you've noticed but you might have seen that the format of the output is already in short sentences rather than how we saw it earlier where we had longer sentences so now you're already seeing shorter sentences as output and now i will show you what i had generated earlier and this is why i didn't want to train because i had trained this earlier uh on the number of steps as 25 000 right so compare hundred and twenty five thousand you will see a difference here and i was saving every five thousand and generating and saving every after every five thousand step and i'm not going to run this again because i think i've overwritten the previous model but even before i try that i'm just going to show you this it's the same example with the prompt indian government and you see indian government 91 percent in center satellite now house create need congress to give handle light pm modis wants to grab now most of these sentences don't make sense uh indian government engineering construction trichy boy industrial cisco privatization center dirty to appear for bihar one and in fact some of these words also don't make sense but you already see that most it's getting the words right it's getting some of the meaning right so pm and modi go together it understands that in this case and trichy boy industrial so it's starting to make sense i i would have ideally wanted to run this for more number of steps but that would need more time and more compute i mean i had compute i didn't have the time to run this and check the output so ah this is what i have right now and this is clearly short sentences and clearly very different from what we saw here with the same prompt but for uh not fine tune where is that it's the same prompt but i've not fine-tuned this right it's the gpd2 default model that's being used and this one has been fine-tuned on indian news headlines data set difference in the style difference in the vocabulary difference in the way uh the data is being generated of course the quality is not that great the quality i i am sure would improve with more iterations and more experiments on the temperature and also perhaps even the base model on which i'm using to fine-tune this is again the most the one with lesser number of parameters but if i have a model with more number of parameters then perhaps uh i would be able to fine-tune better right so those could be areas that you could work on and improve to get a better quality model but i think this is already a good place to get started and start working towards building your own gpt2 or even gpt3 generated text generation tool right so now you can go ahead and build your own gpd2 and gpt text generation tool using just a few lines of code thanks to maxwell's ai text gen tool and that's all that i have for today thank you so much for watching i hope this video is useful and i hope you go on to make some of these text generation tools please do like share and comment on this video if you find it found it really interesting and please don't forget to subscribe to the channel because your subscription will motivate me to keep creating more such videos until then until i see you in the next video have a good time and until next time

Original Description

Have you ever wanted to build your own AI generation tool? If yes, then this video is for you! In this video, I take you through aitextgen, a robust Python tool for text based AI training and generation using GPT-2 or GPT-3 (EleutherAI's open sourced GPT-Neo version which aims to reproduce OpenAI's GPT-3). In the video, I take you through using the pre-trained GPT-2 model to generate texts with your own prompts. We also go through using a custom dataset (Indian news headlines dataset from Kaggle) to finetune a model to generate text as per your custom dataset. If you found this video useful, please do like, share and subscribe to the channel!
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Imaad Mohamed Khan · Imaad Mohamed Khan · 32 of 34

1 Does AI know Fashion? - Mitali Sodhi - Mantissa Data Science Meetups
Does AI know Fashion? - Mitali Sodhi - Mantissa Data Science Meetups
Imaad Mohamed Khan
2 Mantissa Data Science Webinar - 1 with Santhosh Shetty
Mantissa Data Science Webinar - 1 with Santhosh Shetty
Imaad Mohamed Khan
3 Recommender Systems -  Imaad Mohamed Khan - Mantissa Data Science Meetups
Recommender Systems - Imaad Mohamed Khan - Mantissa Data Science Meetups
Imaad Mohamed Khan
4 Data Science is more than just Data Scientist - Different Roles in the field of Data Science
Data Science is more than just Data Scientist - Different Roles in the field of Data Science
Imaad Mohamed Khan
5 What topics to prepare for Data Science Interviews in 2020?
What topics to prepare for Data Science Interviews in 2020?
Imaad Mohamed Khan
6 Programming as a human activity
Programming as a human activity
Imaad Mohamed Khan
7 What are the languages or tools used by Data Scientists in their work?
What are the languages or tools used by Data Scientists in their work?
Imaad Mohamed Khan
8 Linear Regression From Scratch - Part 1
Linear Regression From Scratch - Part 1
Imaad Mohamed Khan
9 Linear Regression From Scratch - Part 2
Linear Regression From Scratch - Part 2
Imaad Mohamed Khan
10 Linear Regression From Scratch - Part 3
Linear Regression From Scratch - Part 3
Imaad Mohamed Khan
11 Journey into Data Science - Fireside chat with Adarsha and Karthikeyan
Journey into Data Science - Fireside chat with Adarsha and Karthikeyan
Imaad Mohamed Khan
12 Off the ground - Python in 5 Steps
Off the ground - Python in 5 Steps
Imaad Mohamed Khan
13 How LinkedIn uses Data Science to build your feed - LinkedIn Feed Algorithm Explained
How LinkedIn uses Data Science to build your feed - LinkedIn Feed Algorithm Explained
Imaad Mohamed Khan
14 Fireside chat with Eric Weber - Learnings in Data Science
Fireside chat with Eric Weber - Learnings in Data Science
Imaad Mohamed Khan
15 Part 2 - How LinkedIn uses Data Science to build your feed | LinkedIn Feed Algorithm Explained
Part 2 - How LinkedIn uses Data Science to build your feed | LinkedIn Feed Algorithm Explained
Imaad Mohamed Khan
16 Using Streamlit's Share Feature to easily deploy (and share) videos using Github
Using Streamlit's Share Feature to easily deploy (and share) videos using Github
Imaad Mohamed Khan
17 Airbnb Experiences Ranking Algorithm Explained - Part I
Airbnb Experiences Ranking Algorithm Explained - Part I
Imaad Mohamed Khan
18 Airbnb Experiences Ranking Algorithm Explained - Part II
Airbnb Experiences Ranking Algorithm Explained - Part II
Imaad Mohamed Khan
19 Airbnb Experiences Ranking Algorithm Explained - Part III
Airbnb Experiences Ranking Algorithm Explained - Part III
Imaad Mohamed Khan
20 Big Data, Hadoop and Machine Learning Explained using Dams
Big Data, Hadoop and Machine Learning Explained using Dams
Imaad Mohamed Khan
21 Fireside Chat with Hiromu Hota - Transitioning from Research to Industry
Fireside Chat with Hiromu Hota - Transitioning from Research to Industry
Imaad Mohamed Khan
22 Introduction to Anomaly Detection and One Class Classification
Introduction to Anomaly Detection and One Class Classification
Imaad Mohamed Khan
23 Reading and manipulating Google Sheets (GSheets) using Python libraries
Reading and manipulating Google Sheets (GSheets) using Python libraries
Imaad Mohamed Khan
24 Writing to Google Sheets (GSheets) using Python libraries
Writing to Google Sheets (GSheets) using Python libraries
Imaad Mohamed Khan
25 Fireside Chat with Mirza Rahim Baig - Business Problem Solving and Data Science Career Tips
Fireside Chat with Mirza Rahim Baig - Business Problem Solving and Data Science Career Tips
Imaad Mohamed Khan
26 Six types of Data Analysis you will do as a Data Scientist
Six types of Data Analysis you will do as a Data Scientist
Imaad Mohamed Khan
27 Automatic Speech Recognition (ASR) with Facebook AI's wav2vec 2.0 model using Huggingface
Automatic Speech Recognition (ASR) with Facebook AI's wav2vec 2.0 model using Huggingface
Imaad Mohamed Khan
28 9 Anti-patterns to avoid MLOps mistakes
9 Anti-patterns to avoid MLOps mistakes
Imaad Mohamed Khan
29 8 pitfalls to avoid while using Machine Learning Interpretation Techniques (SHAP, PDP, LIME, PFI)
8 pitfalls to avoid while using Machine Learning Interpretation Techniques (SHAP, PDP, LIME, PFI)
Imaad Mohamed Khan
30 Fireside Chat with Shadab Khan - AI in Healthcare and Data Science Career Tips
Fireside Chat with Shadab Khan - AI in Healthcare and Data Science Career Tips
Imaad Mohamed Khan
31 Features and Feature Engineering in Machine Learning - An Introduction
Features and Feature Engineering in Machine Learning - An Introduction
Imaad Mohamed Khan
Building your own AI text generation tool with aitextgen using GPT-2/GPT-3
Building your own AI text generation tool with aitextgen using GPT-2/GPT-3
Imaad Mohamed Khan
33 Organising Data Science projects using CRISP-DM
Organising Data Science projects using CRISP-DM
Imaad Mohamed Khan
34 Introduction to Prompt Engineering
Introduction to Prompt Engineering
Imaad Mohamed Khan

This video teaches viewers how to build a custom AI text generation tool using aitextgen and fine-tune it on a specific dataset, such as news headlines. The tool can be used to generate coherent text without repetition. Viewers learn how to install aitextgen, train a custom model, and optimize prompt parameters for desired output.

Key Takeaways
  1. Install aitextgen using pip command
  2. Create an instance of the aitextgen class with GPT-2 model
  3. Generate text using the generate function with parameters such as number of texts, prompt, and maximum length
  4. Train a custom BPA tokenizer on the downloaded text
  5. Fine-tune the model with a batch size of 8 and 25,000 steps
  6. Use aitextgen to generate text with specified temperature
💡 The video highlights the importance of fine-tuning a pre-trained language model on a specific dataset to achieve desired results, and demonstrates how to use aitextgen to build a custom text generation tool.

Related AI Lessons

AI - Understanding it the modern way
Learn how AI is integrated into daily life and its modern applications
Dev.to · Riturathin Sharma
The AI Approval Gate: What Anthropic’s Mythos 5 Decision Means for Your Business
Understand the implications of Anthropic's Mythos 5 decision on AI approval and usage for businesses
Medium · Cybersecurity
The AI Moat Paradox: The Better Models Become, the Less Models Matter
The AI moat paradox suggests that as AI models improve, their importance may decrease, and understanding this concept is crucial for AI professionals and businesses.
Medium · AI
170,927 AI Papers Reveal the Biggest Research Shifts of the First Half of 2026
Discover the biggest AI research shifts of 2026 based on 170,927 papers, and learn how to apply these trends to your work
Medium · Machine Learning
Up next
Man dies after horror Gold Coast house fire; high-speed Sydney motorway pursuit | 9 News Australia
9 News Australia
Watch →