Generative Python Transformer p.4 - Tokenizing
Key Takeaways
The video demonstrates tokenization using Byte Pair Encoding (BPE) and the Hugging Face tokenizers library, highlighting its importance in natural language processing and language model training. It showcases the process of training a tokenizer, creating a vocabulary, and using it for tokenization.
Full Transcript
what is going on everybody and welcome to i think part four of the generative python transformers videos uh in this video what we're gonna be doing is building a tokenizer and so if you're a little unfamiliar with tokenizers we use them with natural language processing to convert strings of text because this is kind of unacceptable the machine doesn't really understand strings of text we're trying to convert that into something the machine does understand like vectors of values right that that makes more sense to the machine that's quite a large one let's uh let's shrink a couple of these uh so so this is more along the lines of what the machine expects as input so tokenization is all about converting this to something more like this now how we do that there's so much uh so many different ways that we can do that from a classic kind of bag of words model where just each word gets a unique id arbitrarily all the way to what we're going to be using today which i think is kind of the latest and greatest which is bite level bite paren coding which is basically you're going to it's even like kind of the industry standard was to take sub words um and kind of piece those together but even even when you do that you would be very surprised how many possible combinations of strings there are and you you either need a huge uh a huge vocabulary you know which i don't know like a million okay uh or you have to deal with unk tokens which are just if it's not contained within the vocab it gets an unk token or an unknown um so unk tokens are very makes it hard for the model the model is never going to really learn what an unk token is um and then having a like i say a vocab size of 1 million that's a very large vector that's also very hard for a model to learn so it seems to me that byte-level bite-pairing coding essentially allows you to have a very very small vocab size and yet have complete and total coverage and never have an unk token so that's what we're going to be using and for this tutorial or video i'm not really calling this a tutorial i am no expert on um on any of this right i'm very new to hugging face i'm very new to transformers as well i'm this is me learning so i'm gonna be referencing the following things this is kind of the hello world example from hugging face uh there's quite a few things that have changed and have been updated here i don't know why one of the things at least they acknowledge i don't know why they don't update this maybe someone thinks for seo purposes that it's better to keep it um fire that person but then there's also a collab that they link to uh and this has kind of one of the updates we'll talk about it when we get there but even that is outdated so we'll we'll make another fix to that and then finally i've got the docs up for the gpt2 versions because this hello world example is actually using a bert model now i do plan to possibly check out using a burp model at some point i think combining a burp model and gpt2 model like having them both at the same time would make a pretty awesome autocomplete so i kind of want to check them both out but for now we're going to be doing uh gpt 2. so i will be referencing that stuff so if you see me looking over that's what i'm doing so uh so yeah let's go ahead and begin so the first thing that we're going to do is from tokenizers we're going to import that byte level bpe bp uh tokenizer so from here what we want to do is let me open up our directory yeah so uh the the tokenizer takes in a cluster of files or at least that's kind of the default if you just happen to have one large file that's fine but they can also take in a kind of a big batch of files so uh so yeah so we'll start with uh reading this file in so we'll just say paths and we'll say that equals a list and then we'll just grab the file name here and pass that in but as time goes on you could have probably chunks of files and paths as far as i can tell this will not help you memory wise like this doesn't really matter if this is either one giant file or a cluster of little files um the problems that we will hit later on down the road uh seem to be unimpacted by whether it's one giant one or a bunch of little ones let me put this over here cool so then what we're going to specify is a tokenizer and this part kind of confused me when i first got started here and you basically you're going to wind up kind of defining a tokenizer twice so first you have this like tokenizer this like bait i'm gonna call it a base tokenizer okay uh you have this like base tokenizer that you train first and then you load it into um the uh i don't know the model specific tokenizer that you intend to use via hucking face so again i think that's all kind of hugging face api type stuff but um it at least for me that kind of confused me like why are we defining a tokenizer twice but anyway i just want to raise your attention to that just in case we get to it and i don't talk further about why we're doing that so tokenizer so this is a byte level bpe tokenizer now what we want to do is we want to actually train that tokenizer let me see if i can get a good copy and paste possibly from cool so in fact let's just take i'm just literally copying and pasting from this right here i'm going to copy that and paste it here boop so pads pads that's good 52k that's i think the default for gpt2 min frequency my guess is it has to occur twice that shouldn't actually be necessary i'll leave it there because that's there but uh this is the beginning of an a sentence uh this is padding so if you're if you know you you you can't have a variable input to the neural network so if you so for example i think the actual length of gpt2 the context is 1024 so 1024 and we have limited we haven't even tokenized yet because i didn't we didn't we have no idea how like before you tokenize and before you build your data set you have no idea of knowing um how long will a sequence let's say a sequence is 100 characters how long will that sequence be after it gets tokenized we we don't know we can't know until we have a tokenizer so and i don't know that i want to truncate stuff like ideally we would have samples that aren't truncated because there might be a really important piece of information that got truncated so i just i would rather not do that so anyway i'm just i'm just trying to find out and so right now we're using 512 but eventually we'll probably increase that but anyways that's what the padding is for because you might have some samples that are 400 tokens long and then some samples are 500 tokens long they're all going to be passed in as a 1024 context so you're gonna have to pad the rest of them like the rest of the tokens um unk hopefully we don't need the unk token and then the mask token we purely need that for later when we go to generate sequences the generator is going to try to fill in the mass token which in our case will just be at the end of the uh the sequence more on that later just know these are just kind of some starting special tokens so then uh save uh when did we train it oh did we oh it's right here i was like wait we're saving the model we're not even training it um so that's funny so what we'll do is we'll just save it to a folder called tokenizer and i don't know if it's going to make it so we'll just go ahead and make it real quick tokenizer cool so uh let me open up a terminal now and let us python 3 tokenizer dot pi and yeah let's go ahead and run that see what happens so it goes pretty quick uh this is a small file so it's not too too surprising and what we should have if we did we did save it i hope uh tokenizer yes so now you have two files in here you've got merges.txt and then you've got vocab.json so mergers.txt i think is more of your byte level information what we want that we can actually kind of look at and read is here so these are your tokens and then their respective id now i think this will be ordered by um uh occurrence that's the word i was looking for but i'm not 100 on that i could be wrong um so anyway here we have like for example support constructed um and then this is its id did we go all the way to the end yeah we did let's go let's pick something a little more common um apparently this you know group of of letters for example l-i-e-s this is not lies alone this could be part of something that says implies or something like that right these are just pieces of words um and i forget what the the uh these like g characters someone can comment below it's sick it signifies maybe start of sentence or her um you know precedes a space or follows a space rather um i'm trying to think of what else anyway i think it has something to do with that but anyway someone someone smarter than me can comment below what those little funky g characters are doing um anyway um yeah so you can kind of look through here and in fact one thing we might even help is if we turned on uh word wrap am i blind am i it is under view there it is boop okay so now we can see all of our our characters so yeah my guess is maybe it's not an order of occurrence because i really doubt this is a super common one so anyway i don't know um oh yeah maybe the g is like a space one two three four five yeah maybe that's what it is maybe it's a space character i don't know like i said someone feel free to educate me um if i'm seeing such weird ones like this that i think the g is it's like some sort of space so it's like you know an amount of indentation essentially anyway cool so we have a tokenizer now you might be thinking oh we're done but we are not um so interestingly enough so coming back uh to reference our tutorial here and also i think one thing we'll do is um i'm going to make a constant here train i don't know train base and we'll set this now to be false so we only need to train this one time we don't need to do this every single time so if train base uh we'll leave paths actually tokenizer um i think this will be good like that cool so we don't need to waste time uh training that every single time now we can test this tokenizer um so we can just make up some input real quick so input will be uh print oh we can't use the double quotes let's just use single quotes for now hello world and then uh what is it tokenizer input and then the input so we'll say t equals tokenizer dot tokenize uh if we run this again we need to load it what was the i forget what the syntax to load it is but uh we'll get there in a second tokenizer dot uh not tokenized encode input and then i think it's ids and tokens yes so we'll print t.ids and we'll print t.tokens now the ids will be the actual um numbers i guess we'll call it and then the tokens this will be the corresponding subword i guess i'll call it uh and we are almost ready but what we actually want to do is uh hmm i kind of want to load the uh i don't know i'm not really sure what i want to do because we could load we could i think the token i forget if it's like load or load model um but then we're right about to also uh what i lost my uh my hugging face docs here we go save do they load it back i just forget what the syntax is after you've saved it yeah okay we're going to train it one more time just so you can see see these values and then we'll just stop printing them out i suppose because we're about to make the gpt2 tokenizer and i don't feel like adding a bunch of logic and then not using it so again so these are your ids so these are just the numerical values so print you know print hello world became this vector right here and then this is just showing you the chunks so hello world uh interestingly enough we actually did not get any sub words they just it just learned to split out certain elements but interest also interestingly for example an opening parenthesis with a single quote that is its own unique token so if we changed to let's say we change this to this we should see that it actually tokenizes it differently hello world and then we'll come over here we'll rerun that right so the you know hello and world wound up with the exact same uh you know ids but the thing that defers and same thing with like the explain ex exclamation point uh but the things that defer are these tokens here right uh opening parentheses and then double quote that gets a different uh a different id but i promise there will also be sub words i can't think of a really good one maybe some of those like really long exceptions um that might be one you know like an import is probably going to be its own token um anyway so we now have um we've got this like trained tokenizer that is saved and i'm actually just go well we'll leave input but i'm going to delete this other stuff so once we have this like base tokenizer the next thing that we need is the tokenizer that actually corresponds to the model that we're going to use it use it with so i guess we'll come up to the tippity top and what we'll do here is from transformers we're going to import gpt2 config gpt 2 it'll be a miracle if i don't typo any of this lm for language model uh head model and gpt 2 tokenizer so again this is kind of the part that confused me and i could be wrong but i don't think you can just load gpt to tokenizer like this uh and do like like i don't think you can actually do what just happened i don't think you can actually do that um i don't think it works that way in effect uh so if we go python 3 uh tokenizer missing to do we not true right yeah so so when you attempt to like initialize this this parameter it's not going to work because um or initialize this object uh it's not going to work because it expects to have that vocab file already so yeah i thought so so coming back over here so what we do is we first train this like byte level bpe encoder and then once we have that we can load the results of that in via that gpt2 tokenizer so to do that what we will do is once again specify tokenizer equals and this will be gpt to tokenizer dot uh from underscore pre-trained and it will be tokenizer tokenizer from pre-trained i almost wonder is there a i just i don't think so in it missing two like it like you would have to no matter what you could not just you can't even initialize it without a vocab file in a merges file but then to you have to say dot from pre-trained i wonder if it would work if you didn't even say from pre-trained i don't know anyway whatever this is what the tutorials say to do so we're going to do it this way anyway so this just tells it the directory uh from where to load the uh the tokenizer now i also found that these like so like let's pull back up oh vocab that's bad unk mask so we already have these but uh you still have to let the model know so once we've loaded in that tokenizer i'm going to copy this real quick we're not typing that out and paste so we still need to add special tokens and just like notify the gp2 gpt2 tokenizer uh say that really quickly uh we have to still tell it this is the end of sentence token this is the beginning of sentence this is the unk pad mask so we still have to inform it hey these are the these are the ones that we're going to use and essentially what this means is um you could actually make these anything you want you don't have to use these ones now i will say i i did i tried to make like my own new line token and i couldn't figure it out i couldn't figure out how to get it to just like recognize like i want this new line token i want that to be a token i could not get it to work so um there's that i did try to add one but it seems like they just have their own special list of you know these special tokens and you can only like pick those you can't like make more so anyway um if somebody knows how to add like truly add a token like a new line character a token let me know so after we've done that we can redo the the tokenization now so we can say t equals tokenizer and again it'll be dot encode uh the input and then in this case uh we'll print t and this one's a little different so in this case when we print t we don't have to say t.ids it will just dot encode actually encodes it so to hopefully we did this right oh we're probably going to retrain let me change that to false while i'm thinking about it so we don't need to do anything like that again but anyway when it was done uh you can see here it did uh it encoded them and the ids are the same as what we had before okay so now we have our uh gpt2 tokenizer and i think um i think that's basically everything i really want to show i guess the only other thing is now that we have the gpt2 tokenizer take note that so when we pass we are going to encode stuff that that goes into the model and then the model is going to spit out its decode so the d code is also going to be something like this it'll be a vector of values or maybe it'll be like one more new value that gets appended or you know who knows it kind of depends on what your task is but um the only other task that you're gonna do with your tokenizer is you might also decode so we could also print um tokenizer.decode t and this should just bring us back to where we uh started right so the so one thing to take note of is it did not do something like this where it was like this vector or list of sub strings that's not what it did when it decoded it actually just rebuilt back together the sentence so once you pass things through the model you just decode them and then you could have you know this beautifully formatted code so to speak we'll still have to fix the new line character thing but other than that that's all we'll we'll really have to do so uh from there i think i will save uh the next step for the next tutorial and that will be actually training the model so the next thing that we have to do once we have the tokenizer that's our kind of final uh dare i call it a pre-processing step i think it's fair to say that's the final pre-processing step before we actually start feeding data into our model so essentially what we're going to want to do is load in uh this very small this very small file uh into our model in an attempt to train something from it again uh i won't actually train i'll i'll see if this model generates anything for us this this code mostly because i'd like to at least host a file big enough that you could train something from it but then i will actually be training a model on the the dgx station a ver a full-size model on all of the code that i have which is about 100 gigs of python code and i suspect that model could turn out to be uh a pretty good model so i'm excited to see that i don't think on this data set uh i i don't think it'll be a very good data set a very good model rather uh from this data set but who knows uh we'll see so uh yeah if you guys have any questions comments concerns uh whatever you know the deal feel free to leave them below this video is sponsored by a neural network from scratch i don't know if anybody's ever heard of neural networks from scratch but you can learn more about it from nnfs.io if you want to learn how to build neural networks uh from scratch in python purely just to learn more about how they work the end end goal is not actually to build your own framework but that is basically what we do but the idea is to uh learn more about how to use the popular libraries uh like keras like pytorch and actually understand you know what's going on there uh just to give you a better understanding and the way that we're doing that is by building neural networks truly from scratch and python so anyway if that's something that sounds interesting to you you can go to nnfs.io and check it out otherwise i will see you guys in another video you
Original Description
Neural Networks from Scratch book: https://nnfs.io
Channel membership: https://www.youtube.com/channel/UCfzlCWGWYyIQ0aLC5w48gBQ/join
Discord: https://discord.gg/sentdex
Reddit: https://www.reddit.com/r/sentdex/
Support the content: https://pythonprogramming.net/support-donate/
Twitter: https://twitter.com/sentdex
Instagram: https://instagram.com/sentdex
Facebook: https://www.facebook.com/pythonprogramming.net/
Twitch: https://www.twitch.tv/sentdex
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from sentdex · sentdex · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Matplotlib Python Tutorial Part 1: Basics and your first Graph!
sentdex
Python Encryption Tutorial with PyCrypto
sentdex
Python's Logging Function
sentdex
wxPython Tutorials 1: Making Windows GUIs with Python : Installing + 1st window!
sentdex
wxPython Tutorials 2: Making Windows GUIs with Python: Customizing Window Parameters
sentdex
wxPython Programming Tutorial 3: Menu Bar and Menu Button
sentdex
wxPython Programming Tutorial 4: Panels
sentdex
wxPython Programming Tutorial 5: User Input Saved To Variables
sentdex
wxPython Programming Tutorial 6: Multiple Choice Input
sentdex
wxPython Programming Tutorial 7: Adding Static Text and Colors
sentdex
wxPython Programming Tutorial 8: Custom Button Images
sentdex
wxPython Programming Tutorial 9: Tool Bar Items and Sub Menus!
sentdex
Basic PHP Tutorial 13: Multi-dimensional Array
sentdex
Basic PHP Tutorial 15: Functions and Global Variables
sentdex
Basic PHP Tutorial 12: Associative Array
sentdex
Basic PHP Tutorial 14: Foreach loop
sentdex
Basic PHP Tutorial 16: Include and Require
sentdex
Basic PHP Tutorial 7: Assignment, comparison and Logical operators
sentdex
Basic PHP Tutorial 4: Variables and Comments
sentdex
Basic PHP Tutorial 11: Arrays part 1, basic array
sentdex
Basic PHP Tutorial 6: If else and else if conditionals cont'd
sentdex
Basic PHP Tutorial 1: Intro to PHP
sentdex
Basic PHP Tutorial 3: HTML with PHP
sentdex
Basic PHP Tutorial 9: While Loop
sentdex
Basic PHP Tutorial 10: Switch Statement
sentdex
Basic PHP Tutorial 2: Print and Echo
sentdex
Basic PHP Tutorial 5: If else and else if conditional statements
sentdex
Basic PHP Tutorial 8: Arithmatic Operators: Doing math with php
sentdex
Basic PHP Tutorial 17: User Input Form Example / String Manipulation
sentdex
Basic PHP Tutorial 18: HTML Entities and forms cont'd
sentdex
Basic PHP Tutorial 19: Finding words in strings
sentdex
Basic PHP Programming Tutorial 20: Saving to a File / writing and appending
sentdex
Basic PHP Programming Tutorial 22: Hashing part 2: salting
sentdex
Basic PHP Programming Tutorial 23: Variables in Strings and tokenizing
sentdex
Basic PHP Programming Tutorial 21: MD5 Hashing For Security
sentdex
Basic PHP Programming Tutorial 24: String similarity
sentdex
Basic PHP Programming Tutorial 25: Time and Time stamps
sentdex
Basic PHP Programming Tutorial 26: Die and Exit
sentdex
Basic PHP Programming Tutorial 27: MySQL Databases Part 1
sentdex
Basic PHP Programming Tutorial 28: MySQL Database Part 2: Reading From Database
sentdex
Basic PHP Programming Tutorial 29: MySQL Database Part 3: Inputting Data
sentdex
Basic PHP Programming Tutorial 30: MySQL database in Use
sentdex
Django Tutorial Web Development with Python Part 1: Installing Django
sentdex
Python Tutorial: File Deletion and Folder Deletion / directory deletion
sentdex
Python Tutorial: How to Rename Files and Move Files with Python
sentdex
3D Graphs in Matplotlib for Python: Basic 3D Line
sentdex
3D Plotting in Matplotlib for Python: 3D Scatter Plot
sentdex
3D Charts in Matplotlib for Python: Multiple datasets scatter plot
sentdex
Sikuli Tutorial 1: Visually programming in python!
sentdex
Sikuli Tutorial 2: Program visually in python!
sentdex
Sikuli Tutorial 3: Program visually in python!
sentdex
3D Bar Charts in Python and Matplotlib
sentdex
3D Plane wire frame Graph Chart in Python
sentdex
Raspberry Pi Part 1 Introduction
sentdex
Raspberry Pi Part 8: First Download and Update! (Firmware)
sentdex
Raspberry Pi Part 10: How to set up a Linux Web Server on your Pi
sentdex
Raspberry Pi Part 11: Remote Desktop
sentdex
Twitter Analysis: How to rank a user's influence
sentdex
GPIO Tutorial for Pi Part 2 - Programming the GPIO
sentdex
GPIO Tutorial for Raspberry Pi Part 1 - Setting up
sentdex
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Claude AI vs ChatGPT: Which One Is Actually Better in 2026?
Medium · AI
Claude AI vs ChatGPT: Which One Is Actually Better in 2026?
Medium · Programming
IntelliBooks: Classic RAG vs Graph RAG vs Agentic RAG – Choosing the Right AI Retrieval Architecture for Enterprise AI
Dev.to AI
Fluid, natural voice translation with Gemini 3.5 Live Translate
Dev.to AI
🎓
Tutor Explanation
DeepCamp AI