Build a Custom Transformer Tokenizer - Transformers From Scratch #2

James Briggs · Intermediate ·🧠 Large Language Models ·5y ago

Key Takeaways

The video demonstrates building a custom transformer tokenizer using Hugging Face's Tokenizers package and training it on the Italian subset of the OSCAR dataset, with a focus on creating a custom tokenizer for less common languages. It utilizes tools such as pathlib, glob, tokenizers, and RoBERTa, and covers concepts including tokenizer construction, text processing, and natural language processing.

Full Transcript

hi welcome to the video we're going to have a look at how we can build our own tokenizer in transformers from scratch so this is the second video in our transformers from scratch series and what we're going to be covering is that the actual tokenizer itself so we've already got our data so we can cross off now onto the tokenizer so let's move over to our code so in the previous video we created all these files here so these are just a lot of text files that contain the italian subset from the oscar dataset now let's maybe open one ignore that and we just we get all this italian now each sample in this text file is separated by a new line character so let's go ahead and begin using that data to build our tokenizer so we first want to get a list of all the paths to our files so we are going to be using the path lib you could also use os lister as well it's it's you import so sorry import path so from pathology import path i'm using this one because i don't know i've noticed that people are using this a lot at the moment for machine learning stuff i'm not sure why you would do it over os list there but it's what people are using so let's you know give it a go see how it is so we have this and we just want to create a string from each path object that we get so for x in and then in here we need to write path and in here we just want to basically tell this where to look so we're using path here and we're just in the same directory so it's not we don't really need to do anything here that's fine and then at the end we are going to use glob here i think this is why people are using this and we just create like a wild card like we want all text files in this directory so we we just write that now let's do that i'll look at the first five and see that we have our our text files now so that's good and what we can now do is move on to actually training the tokenizer so the tokenizer that we're going to be using is a byte level by pair encoding tokenizer or bp tokenizer and essentially what that means is that it's going to break down our text into into bytes so with most tokenizers that we you probably use unless you've used this one before then you use it for we we tend to have like unknown tokens so like for birth we use sentence piece encodings and we have to have this unknown token for when we don't have a a token for a specific word like for some new word now with the bpe tokenizer we are breaking things down into bytes so essentially we don't actually need an unknown token anymore so that's i think pretty cool now to use that we need to do from tokenizers so this is a another hugging face package so you maybe you need to you might need to install that so pip install tokenizers and you want to do byte level bpe tokenizer like that okay now we take that and we're going to initialize our tokenizer so we just write that that's our tokenizer initialized we haven't trained it yet let's train it we need to write tokenizer train and then in here we need to include the files that we're training on so this is why we have that past variable up here so this is just a list of all of the the text files that we created which are all separated by new line characters each sample is separated by a new line character now the vocab size we're going to be using a roberta model here and i think the roberta model typical roberto model vocab size is 50k now i mean we can you can use that if you want this up to use but i'm going to stick with the typical bert size just because i don't think we need that much you know we're just figuring things out here so you know this is going to mean less training time and that's a good thing in my opinion we don't set the min frequency so this is saying what is the minimum number of times you want to see a word or a part of a word or a byte so it's kind of weird with this tokenizer before you add it into our vocabulary so that's all that is okay and then we also need to include our special tokens so we're using the roberta special tokens here so writes special tokens and then in here we have our start sequence token i'm going to put this on the new line so not not like that like this so we have this start sequence token the padding token end of sequence which is like this the unknown token which with it being a by-level encoding you'd hope it doesn't need to use this very much but it's there anyway and the masculine token so that's everything we need to train our model and one thing i i do remember is if you train on all of that all of those files it takes a really very very long time which is it's fine if you're training it overnight or something but that's not what we're doing here so i'm just going to shorten that to the first 100 tokens and maybe maybe i'll train it after this with with the full set let's see so i will leave that to train for a while and i'll be back when it's done okay so it's finished training our tokenizer and we can go ahead and actually save it so i'm going to import os i'm just soon so i can make a new directory to store the tokenizer files in and a typical italian name also i've been told is filiberto which fits really well but so this is this is our italian italian bert model name philiberto so that is our new directory and if we just come over to here we have this working directory which is what i'm in and then we have this new directory philiberto in here that's where we're going to save our tokenizer so we just write tokenizer save model and here we can can do we you can see here we can do save or save model save just saves a json file with our tokenizer data inside it but i don't think that's a standard way of doing i think this is the way that you want to be doing it and we're saying it's filiberto like that so we'll do that and we see that we get these two new files vocab.json and mergers.txt now if we look over here we see both of those and these are essentially like the two sets of tokenization for our tokenizer so when we feed text into our tokenizer it first goes to mergers.txt and in here we have characters words so on and they are translated into these tokens so these are characters on the right tokens on the left so we scroll down we can see different ones we can keep going so here we have zeone that's like although my challenge very bad that is like the english t ion so tion and we we would say stuff like attention right italians have the same but they have like attention so that's what we have there so it's part of a word and it's pretty common and that gets translated into this token here now after that our tokenizer moves into vocab jason and i don't know what side of the at the bottom there go to the top if i clean this up quickly we can see we have a json object it's like a dictionary in python and we have all of our tokens and the token ids that they will get translated into so we if we scroll down here we could we should be able to find was it va i think okay so va which is our zeone into this token here and then that eventually gets converted into this token id so that's our full tokenizer process let's open that file back up if we wanted to load that we would do that like we normally would with transformers so we start from transformers import roberta so we're using a roberta tokenizer here so we're about to turkenizer we can use either the robot tokenizer or the fast version it's up to you and we just initialize our tokenizer like that we from pre-trained and in here rather than putting a model name from the hooking face website we would put the path local path to our directory our model directory so it's philiberto for us and then we can use that to begin encoding text so go ciao coming back which is like hi how you if we write that we can see that we get these are the tokens here i wonder if we did a 10 um certain i'll do i'll try in a minute so we have the start sequence token here and the sequence token here so the the s and the [Music] s like that so we have those at the saw and end of each sequence and we can also add padding in there so padding equals max length and also max length needs to have a value as well so maximum 512 and then we get these padding tokens which are the ones so that's pretty cool and i just want to let's purely mark curiosity anything else so we have potentiona let's see if we if that if we recognize the number there so no we don't so i suppose this is probably the the full before word in fact it is so this is a the full token here if we if we just do this maybe we will get i can't remember what number it was just the three three two two maybe maybe that's right i'm not sure but anyway that's that's how everything works so that that's it for this video in the next video we will take a look at how we can use this tokenizer to build out our input pipeline for training our actual transformer model so say everything and i'll see you in the next one

Original Description

How can we build our own custom transformer models? Maybe we'd like our model to understand a less common language, how many transformer models out there have been trained on Piemontese or the Nahuatl languages? In that case, we need to do something different. We need to build our own model - from scratch. In this video, we'll learn how to use HuggingFace's tokenizers library to build our own custom transformer tokenizer. Part 1: https://youtu.be/GhGUZrcB-WM --- Part 3: https://youtu.be/heTYbpr9mD8 Part 4: https://youtu.be/35Pdoyi6ZoQ 🤖 70% Discount on the NLP With Transformers in Python course: https://bit.ly/3DFvvY5 📙 Medium article: https://towardsdatascience.com/transformers-from-scratch-creating-a-tokenizer-7d7418adb403 📖 If membership is too expensive - here's a free link: https://towardsdatascience.com/transformers-from-scratch-creating-a-tokenizer-7d7418adb403?sk=aea909609f41be43bdb2dbbd75a801f2 👾 Discord https://discord.gg/c5QtDB9RAP 🕹️ Free AI-Powered Code Refactoring with Sourcery: https://sourcery.ai/?utm_source=YouTub&utm_campaign=JBriggs&utm_medium=aff
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from James Briggs · James Briggs · 43 of 60

1 Stoic Philosophy Text Generation with TensorFlow
Stoic Philosophy Text Generation with TensorFlow
James Briggs
2 How to Build TensorFlow Pipelines with tf.data.Dataset
How to Build TensorFlow Pipelines with tf.data.Dataset
James Briggs
3 Every New Feature in Python 3.10.0a2
Every New Feature in Python 3.10.0a2
James Briggs
4 How-to Build a Transformer for Language Classification in TensorFlow
How-to Build a Transformer for Language Classification in TensorFlow
James Briggs
5 How-to use the Kaggle API in Python
How-to use the Kaggle API in Python
James Briggs
6 Language Generation with OpenAI's GPT-2 in Python
Language Generation with OpenAI's GPT-2 in Python
James Briggs
7 Text Summarization with Google AI's T5 in Python
Text Summarization with Google AI's T5 in Python
James Briggs
8 How-to do Sentiment Analysis with Flair in Python
How-to do Sentiment Analysis with Flair in Python
James Briggs
9 Python Environment Setup for Machine Learning
Python Environment Setup for Machine Learning
James Briggs
10 Sequential Model - TensorFlow Essentials #1
Sequential Model - TensorFlow Essentials #1
James Briggs
11 Functional API - TensorFlow Essentials #2
Functional API - TensorFlow Essentials #2
James Briggs
12 Training Parameters - TensorFlow Essentials #3
Training Parameters - TensorFlow Essentials #3
James Briggs
13 Input Data Pipelines - TensorFlow Essentials #4
Input Data Pipelines - TensorFlow Essentials #4
James Briggs
14 6 of Python's Newest and Best Features (3.7-3.9)
6 of Python's Newest and Best Features (3.7-3.9)
James Briggs
15 Novice to Advanced RegEx in Less-than 30 Minutes + Python
Novice to Advanced RegEx in Less-than 30 Minutes + Python
James Briggs
16 Building a PlotLy $GME Chart in Python
Building a PlotLy $GME Chart in Python
James Briggs
17 How-to Use The Reddit API in Python
How-to Use The Reddit API in Python
James Briggs
18 How to Build Custom Q&A Transformer Models in Python
How to Build Custom Q&A Transformer Models in Python
James Briggs
19 How to Build Q&A Models in Python (Transformers)
How to Build Q&A Models in Python (Transformers)
James Briggs
20 How-to Decode Outputs From NLP Models (Python)
How-to Decode Outputs From NLP Models (Python)
James Briggs
21 Identify Stocks on Reddit with SpaCy (NER in Python)
Identify Stocks on Reddit with SpaCy (NER in Python)
James Briggs
22 Sentiment Analysis on ANY Length of Text With Transformers (Python)
Sentiment Analysis on ANY Length of Text With Transformers (Python)
James Briggs
23 Unicode Normalization for NLP in Python
Unicode Normalization for NLP in Python
James Briggs
24 The NEW Match-Case Statement in Python 3.10
The NEW Match-Case Statement in Python 3.10
James Briggs
25 Multi-Class Language Classification With BERT in TensorFlow
Multi-Class Language Classification With BERT in TensorFlow
James Briggs
26 How to Build Python Packages for Pip
How to Build Python Packages for Pip
James Briggs
27 How-to Structure a Q&A ML App
How-to Structure a Q&A ML App
James Briggs
28 How to Index Q&A Data With Haystack and Elasticsearch
How to Index Q&A Data With Haystack and Elasticsearch
James Briggs
29 Q&A Document Retrieval With DPR
Q&A Document Retrieval With DPR
James Briggs
30 How to Use Type Annotations in Python
How to Use Type Annotations in Python
James Briggs
31 Extractive Q&A With Haystack and FastAPI in Python
Extractive Q&A With Haystack and FastAPI in Python
James Briggs
32 Sentence Similarity With Sentence-Transformers in Python
Sentence Similarity With Sentence-Transformers in Python
James Briggs
33 Sentence Similarity With Transformers and PyTorch (Python)
Sentence Similarity With Transformers and PyTorch (Python)
James Briggs
34 NER With Transformers and spaCy (Python)
NER With Transformers and spaCy (Python)
James Briggs
35 Training BERT #1 - Masked-Language Modeling (MLM)
Training BERT #1 - Masked-Language Modeling (MLM)
James Briggs
36 Training BERT #2 - Train With Masked-Language Modeling (MLM)
Training BERT #2 - Train With Masked-Language Modeling (MLM)
James Briggs
37 Training BERT #3 - Next Sentence Prediction (NSP)
Training BERT #3 - Next Sentence Prediction (NSP)
James Briggs
38 Training BERT #4 - Train With Next Sentence Prediction (NSP)
Training BERT #4 - Train With Next Sentence Prediction (NSP)
James Briggs
39 FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
FREE 11 Hour NLP Transformers Course (Next 3 Days Only)
James Briggs
40 New Features in Python 3.10
New Features in Python 3.10
James Briggs
41 Training BERT #5 - Training With BertForPretraining
Training BERT #5 - Training With BertForPretraining
James Briggs
42 How-to Use HuggingFace's Datasets - Transformers From Scratch #1
How-to Use HuggingFace's Datasets - Transformers From Scratch #1
James Briggs
Build a Custom Transformer Tokenizer - Transformers From Scratch #2
Build a Custom Transformer Tokenizer - Transformers From Scratch #2
James Briggs
44 3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)
James Briggs
45 3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)
James Briggs
46 Building MLM Training Input Pipeline - Transformers From Scratch #3
Building MLM Training Input Pipeline - Transformers From Scratch #3
James Briggs
47 Training and Testing an Italian BERT - Transformers From Scratch #4
Training and Testing an Italian BERT - Transformers From Scratch #4
James Briggs
48 Faiss - Introduction to Similarity Search
Faiss - Introduction to Similarity Search
James Briggs
49 Angular App Setup With Material - Stoic Q&A #5
Angular App Setup With Material - Stoic Q&A #5
James Briggs
50 Why are there so many Tokenization methods in HF Transformers?
Why are there so many Tokenization methods in HF Transformers?
James Briggs
51 Choosing Indexes for Similarity Search (Faiss in Python)
Choosing Indexes for Similarity Search (Faiss in Python)
James Briggs
52 Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)
James Briggs
53 How LSH Random Projection works in search (+Python)
How LSH Random Projection works in search (+Python)
James Briggs
54 IndexLSH for Fast Similarity Search in Faiss
IndexLSH for Fast Similarity Search in Faiss
James Briggs
55 Faiss - Vector Compression with PQ and IVFPQ (in Python)
Faiss - Vector Compression with PQ and IVFPQ (in Python)
James Briggs
56 Product Quantization for Vector Similarity Search (+ Python)
Product Quantization for Vector Similarity Search (+ Python)
James Briggs
57 How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
How to Build a Bert WordPiece Tokenizer in Python and HuggingFace
James Briggs
58 Metadata Filtering for Vector Search + Latest Filter Tech
Metadata Filtering for Vector Search + Latest Filter Tech
James Briggs
59 Build NLP Pipelines with HuggingFace Datasets
Build NLP Pipelines with HuggingFace Datasets
James Briggs
60 Composite Indexes and the Faiss Index Factory
Composite Indexes and the Faiss Index Factory
James Briggs

This video teaches how to build a custom transformer tokenizer using Hugging Face's Tokenizers package and train it on a specific dataset, with a focus on creating a custom tokenizer for less common languages. The lesson covers the concepts of tokenizer construction, text processing, and natural language processing, and provides practical steps for building and training a custom tokenizer.

Key Takeaways
  1. Import required libraries (pathlib, glob, tokenizers)
  2. Create a list of file paths to text files
  3. Create a string from each path object
  4. Use glob to find all text files in a directory
  5. Initialize a tokenizer (Byte Level BPE Tokenizer)
  6. Train the tokenizer on a specific dataset
  7. Save the tokenizer as vocab.json and mergers.txt files
  8. Use a pre-trained RoBERTa tokenizer for text encoding
  9. Add start and end sequence tokens to the input text
  10. Add padding tokens to the input text with a maximum length of 512
💡 Building a custom transformer tokenizer allows for the creation of models that can understand less common languages, which is essential for expanding the reach of natural language processing applications.

Related AI Lessons

The 2026 AI Model Release Race: Every Major LLM Launch You Need to Know
Stay updated on the 2026 AI model release race, including major LLM launches like Claude Sonnet 5 and GPT-5.6, to leverage the latest advancements in AI technology
Dev.to AI
Call GPT, Claude, and Gemini from one API key — a 3-step setup
Access GPT, Claude, and Gemini through one API key with a 3-step setup using Modelishub
Dev.to AI
Your LLM Doesn’t Pick Stocks — It Remembers Them
Discover how LLMs remember stock picks rather than making actual predictions, and why this matters for AI-driven investment strategies
Medium · Machine Learning
Word Representation
Learn how word representation works in NLP and its importance in understanding human language, enabling applications like text classification and language translation
Medium · NLP
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →