Build a Custom Transformer Tokenizer - Transformers From Scratch #2

James Briggs · Intermediate ·🧠 Large Language Models ·5y ago

Skills: LLM Foundations90%LLM Engineering80%Fine-tuning LLMs70%ML Maths Basics60%Prompt Craft50%

Key Takeaways

The video demonstrates building a custom transformer tokenizer using Hugging Face's Tokenizers package and training it on the Italian subset of the OSCAR dataset, with a focus on creating a custom tokenizer for less common languages. It utilizes tools such as pathlib, glob, tokenizers, and RoBERTa, and covers concepts including tokenizer construction, text processing, and natural language processing.

Full Transcript

hi welcome to the video we're going to have a look at how we can build our own tokenizer in transformers from scratch so this is the second video in our transformers from scratch series and what we're going to be covering is that the actual tokenizer itself so we've already got our data so we can cross off now onto the tokenizer so let's move over to our code so in the previous video we created all these files here so these are just a lot of text files that contain the italian subset from the oscar dataset now let's maybe open one ignore that and we just we get all this italian now each sample in this text file is separated by a new line character so let's go ahead and begin using that data to build our tokenizer so we first want to get a list of all the paths to our files so we are going to be using the path lib you could also use os lister as well it's it's you import so sorry import path so from pathology import path i'm using this one because i don't know i've noticed that people are using this a lot at the moment for machine learning stuff i'm not sure why you would do it over os list there but it's what people are using so let's you know give it a go see how it is so we have this and we just want to create a string from each path object that we get so for x in and then in here we need to write path and in here we just want to basically tell this where to look so we're using path here and we're just in the same directory so it's not we don't really need to do anything here that's fine and then at the end we are going to use glob here i think this is why people are using this and we just create like a wild card like we want all text files in this directory so we we just write that now let's do that i'll look at the first five and see that we have our our text files now so that's good and what we can now do is move on to actually training the tokenizer so the tokenizer that we're going to be using is a byte level by pair encoding tokenizer or bp tokenizer and essentially what that means is that it's going to break down our text into into bytes so with most tokenizers that we you probably use unless you've used this one before then you use it for we we tend to have like unknown tokens so like for birth we use sentence piece encodings and we have to have this unknown token for when we don't have a a token for a specific word like for some new word now with the bpe tokenizer we are breaking things down into bytes so essentially we don't actually need an unknown token anymore so that's i think pretty cool now to use that we need to do from tokenizers so this is a another hugging face package so you maybe you need to you might need to install that so pip install tokenizers and you want to do byte level bpe tokenizer like that okay now we take that and we're going to initialize our tokenizer so we just write that that's our tokenizer initialized we haven't trained it yet let's train it we need to write tokenizer train and then in here we need to include the files that we're training on so this is why we have that past variable up here so this is just a list of all of the the text files that we created which are all separated by new line characters each sample is separated by a new line character now the vocab size we're going to be using a roberta model here and i think the roberta model typical roberto model vocab size is 50k now i mean we can you can use that if you want this up to use but i'm going to stick with the typical bert size just because i don't think we need that much you know we're just figuring things out here so you know this is going to mean less training time and that's a good thing in my opinion we don't set the min frequency so this is saying what is the minimum number of times you want to see a word or a part of a word or a byte so it's kind of weird with this tokenizer before you add it into our vocabulary so that's all that is okay and then we also need to include our special tokens so we're using the roberta special tokens here so writes special tokens and then in here we have our start sequence token i'm going to put this on the new line so not not like that like this so we have this start sequence token the padding token end of sequence which is like this the unknown token which with it being a by-level encoding you'd hope it doesn't need to use this very much but it's there anyway and the masculine token so that's everything we need to train our model and one thing i i do remember is if you train on all of that all of those files it takes a really very very long time which is it's fine if you're training it overnight or something but that's not what we're doing here so i'm just going to shorten that to the first 100 tokens and maybe maybe i'll train it after this with with the full set let's see so i will leave that to train for a while and i'll be back when it's done okay so it's finished training our tokenizer and we can go ahead and actually save it so i'm going to import os i'm just soon so i can make a new directory to store the tokenizer files in and a typical italian name also i've been told is filiberto which fits really well but so this is this is our italian italian bert model name philiberto so that is our new directory and if we just come over to here we have this working directory which is what i'm in and then we have this new directory philiberto in here that's where we're going to save our tokenizer so we just write tokenizer save model and here we can can do we you can see here we can do save or save model save just saves a json file with our tokenizer data inside it but i don't think that's a standard way of doing i think this is the way that you want to be doing it and we're saying it's filiberto like that so we'll do that and we see that we get these two new files vocab.json and mergers.txt now if we look over here we see both of those and these are essentially like the two sets of tokenization for our tokenizer so when we feed text into our tokenizer it first goes to mergers.txt and in here we have characters words so on and they are translated into these tokens so these are characters on the right tokens on the left so we scroll down we can see different ones we can keep going so here we have zeone that's like although my challenge very bad that is like the english t ion so tion and we we would say stuff like attention right italians have the same but they have like attention so that's what we have there so it's part of a word and it's pretty common and that gets translated into this token here now after that our tokenizer moves into vocab jason and i don't know what side of the at the bottom there go to the top if i clean this up quickly we can see we have a json object it's like a dictionary in python and we have all of our tokens and the token ids that they will get translated into so we if we scroll down here we could we should be able to find was it va i think okay so va which is our zeone into this token here and then that eventually gets converted into this token id so that's our full tokenizer process let's open that file back up if we wanted to load that we would do that like we normally would with transformers so we start from transformers import roberta so we're using a roberta tokenizer here so we're about to turkenizer we can use either the robot tokenizer or the fast version it's up to you and we just initialize our tokenizer like that we from pre-trained and in here rather than putting a model name from the hooking face website we would put the path local path to our directory our model directory so it's philiberto for us and then we can use that to begin encoding text so go ciao coming back which is like hi how you if we write that we can see that we get these are the tokens here i wonder if we did a 10 um certain i'll do i'll try in a minute so we have the start sequence token here and the sequence token here so the the s and the [Music] s like that so we have those at the saw and end of each sequence and we can also add padding in there so padding equals max length and also max length needs to have a value as well so maximum 512 and then we get these padding tokens which are the ones so that's pretty cool and i just want to let's purely mark curiosity anything else so we have potentiona let's see if we if that if we recognize the number there so no we don't so i suppose this is probably the the full before word in fact it is so this is a the full token here if we if we just do this maybe we will get i can't remember what number it was just the three three two two maybe maybe that's right i'm not sure but anyway that's that's how everything works so that that's it for this video in the next video we will take a look at how we can use this tokenizer to build out our input pipeline for training our actual transformer model so say everything and i'll see you in the next one

Original Description

How can we build our own custom transformer models? Maybe we'd like our model to understand a less common language, how many transformer models out there have been trained on Piemontese or the Nahuatl languages? In that case, we need to do something different. We need to build our own model - from scratch. In this video, we'll learn how to use HuggingFace's tokenizers library to build our own custom transformer tokenizer. Part 1: https://youtu.be/GhGUZrcB-WM --- Part 3: https://youtu.be/heTYbpr9mD8 Part 4: https://youtu.be/35Pdoyi6ZoQ 🤖 70% Discount on the NLP With Transformers in Python course: https://bit.ly/3DFvvY5 📙 Medium article: https://towardsdatascience.com/transformers-from-scratch-creating-a-tokenizer-7d7418adb403 📖 If membership is too expensive - here's a free link: https://towardsdatascience.com/transformers-from-scratch-creating-a-tokenizer-7d7418adb403?sk=aea909609f41be43bdb2dbbd75a801f2 👾 Discord https://discord.gg/c5QtDB9RAP 🕹️ Free AI-Powered Code Refactoring with Sourcery: https://sourcery.ai/?utm_source=YouTub&utm_campaign=JBriggs&utm_medium=aff

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from James Briggs · James Briggs · 43 of 60

← Previous Next →

Stoic Philosophy Text Generation with TensorFlow

Stoic Philosophy Text Generation with TensorFlow

How to Build TensorFlow Pipelines with tf.data.Dataset

How to Build TensorFlow Pipelines with tf.data.Dataset

Every New Feature in Python 3.10.0a2

Every New Feature in Python 3.10.0a2

How-to Build a Transformer for Language Classification in TensorFlow

How-to Build a Transformer for Language Classification in TensorFlow

How-to use the Kaggle API in Python

How-to use the Kaggle API in Python

Language Generation with OpenAI's GPT-2 in Python

Language Generation with OpenAI's GPT-2 in Python

Text Summarization with Google AI's T5 in Python

Text Summarization with Google AI's T5 in Python

How-to do Sentiment Analysis with Flair in Python

How-to do Sentiment Analysis with Flair in Python

Python Environment Setup for Machine Learning

Python Environment Setup for Machine Learning

Sequential Model - TensorFlow Essentials #1

Sequential Model - TensorFlow Essentials #1

Functional API - TensorFlow Essentials #2

Functional API - TensorFlow Essentials #2

Training Parameters - TensorFlow Essentials #3

Training Parameters - TensorFlow Essentials #3

Input Data Pipelines - TensorFlow Essentials #4

Input Data Pipelines - TensorFlow Essentials #4

6 of Python's Newest and Best Features (3.7-3.9)

6 of Python's Newest and Best Features (3.7-3.9)

Novice to Advanced RegEx in Less-than 30 Minutes + Python

Novice to Advanced RegEx in Less-than 30 Minutes + Python

Building a PlotLy $GME Chart in Python

Building a PlotLy $GME Chart in Python

How-to Use The Reddit API in Python

How-to Use The Reddit API in Python

How to Build Custom Q&A Transformer Models in Python

How to Build Custom Q&A Transformer Models in Python

How to Build Q&A Models in Python (Transformers)

How to Build Q&A Models in Python (Transformers)

How-to Decode Outputs From NLP Models (Python)

How-to Decode Outputs From NLP Models (Python)

Identify Stocks on Reddit with SpaCy (NER in Python)

Identify Stocks on Reddit with SpaCy (NER in Python)

Sentiment Analysis on ANY Length of Text With Transformers (Python)

Sentiment Analysis on ANY Length of Text With Transformers (Python)

Unicode Normalization for NLP in Python

Unicode Normalization for NLP in Python

The NEW Match-Case Statement in Python 3.10

The NEW Match-Case Statement in Python 3.10

Multi-Class Language Classification With BERT in TensorFlow

Multi-Class Language Classification With BERT in TensorFlow

How to Build Python Packages for Pip

How to Build Python Packages for Pip

How-to Structure a Q&A ML App

How-to Structure a Q&A ML App

How to Index Q&A Data With Haystack and Elasticsearch

How to Index Q&A Data With Haystack and Elasticsearch

Q&A Document Retrieval With DPR

Q&A Document Retrieval With DPR

How to Use Type Annotations in Python

How to Use Type Annotations in Python

Extractive Q&A With Haystack and FastAPI in Python

Extractive Q&A With Haystack and FastAPI in Python

Sentence Similarity With Sentence-Transformers in Python

Sentence Similarity With Sentence-Transformers in Python

Sentence Similarity With Transformers and PyTorch (Python)

Sentence Similarity With Transformers and PyTorch (Python)

NER With Transformers and spaCy (Python)

NER With Transformers and spaCy (Python)

Training BERT #1 - Masked-Language Modeling (MLM)

Training BERT #1 - Masked-Language Modeling (MLM)

Training BERT #2 - Train With Masked-Language Modeling (MLM)

Training BERT #2 - Train With Masked-Language Modeling (MLM)

Training BERT #3 - Next Sentence Prediction (NSP)

Training BERT #3 - Next Sentence Prediction (NSP)

Training BERT #4 - Train With Next Sentence Prediction (NSP)

Training BERT #4 - Train With Next Sentence Prediction (NSP)

FREE 11 Hour NLP Transformers Course (Next 3 Days Only)

FREE 11 Hour NLP Transformers Course (Next 3 Days Only)

New Features in Python 3.10

New Features in Python 3.10

Training BERT #5 - Training With BertForPretraining

Training BERT #5 - Training With BertForPretraining

How-to Use HuggingFace's Datasets - Transformers From Scratch #1

How-to Use HuggingFace's Datasets - Transformers From Scratch #1

Build a Custom Transformer Tokenizer - Transformers From Scratch #2

Build a Custom Transformer Tokenizer - Transformers From Scratch #2

3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)

3 Traditional Methods for Similarity Search (Jaccard, w-shingling, Levenshtein)

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

3 Vector-based Methods for Similarity Search (TF-IDF, BM25, SBERT)

Building MLM Training Input Pipeline - Transformers From Scratch #3

Building MLM Training Input Pipeline - Transformers From Scratch #3

Training and Testing an Italian BERT - Transformers From Scratch #4

Training and Testing an Italian BERT - Transformers From Scratch #4

Faiss - Introduction to Similarity Search

Faiss - Introduction to Similarity Search

Angular App Setup With Material - Stoic Q&A #5

Angular App Setup With Material - Stoic Q&A #5

Why are there so many Tokenization methods in HF Transformers?

Why are there so many Tokenization methods in HF Transformers?

Choosing Indexes for Similarity Search (Faiss in Python)

Choosing Indexes for Similarity Search (Faiss in Python)

Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)

Locality Sensitive Hashing (LSH) for Search with Shingling + MinHashing (Python)

How LSH Random Projection works in search (+Python)

How LSH Random Projection works in search (+Python)

IndexLSH for Fast Similarity Search in Faiss

IndexLSH for Fast Similarity Search in Faiss

Faiss - Vector Compression with PQ and IVFPQ (in Python)

Faiss - Vector Compression with PQ and IVFPQ (in Python)

Product Quantization for Vector Similarity Search (+ Python)

Product Quantization for Vector Similarity Search (+ Python)

How to Build a Bert WordPiece Tokenizer in Python and HuggingFace

How to Build a Bert WordPiece Tokenizer in Python and HuggingFace

Metadata Filtering for Vector Search + Latest Filter Tech

Metadata Filtering for Vector Search + Latest Filter Tech

Build NLP Pipelines with HuggingFace Datasets

Build NLP Pipelines with HuggingFace Datasets

Composite Indexes and the Faiss Index Factory

Composite Indexes and the Faiss Index Factory

This video teaches how to build a custom transformer tokenizer using Hugging Face's Tokenizers package and train it on a specific dataset, with a focus on creating a custom tokenizer for less common languages. The lesson covers the concepts of tokenizer construction, text processing, and natural language processing, and provides practical steps for building and training a custom tokenizer.

Key Takeaways

Import required libraries (pathlib, glob, tokenizers)
Create a list of file paths to text files
Create a string from each path object
Use glob to find all text files in a directory
Initialize a tokenizer (Byte Level BPE Tokenizer)
Train the tokenizer on a specific dataset
Save the tokenizer as vocab.json and mergers.txt files
Use a pre-trained RoBERTa tokenizer for text encoding
Add start and end sequence tokens to the input text
Add padding tokens to the input text with a maximum length of 512

💡 Building a custom transformer tokenizer allows for the creation of models that can understand less common languages, which is essential for expanding the reach of natural language processing applications.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

The 2026 AI Model Release Race: Every Major LLM Launch You Need to Know

Stay updated on the 2026 AI model release race, including major LLM launches like Claude Sonnet 5 and GPT-5.6, to leverage the latest advancements in AI technology

Call GPT, Claude, and Gemini from one API key — a 3-step setup

Access GPT, Claude, and Gemini through one API key with a 3-step setup using Modelishub

Your LLM Doesn’t Pick Stocks — It Remembers Them

Discover how LLMs remember stock picks rather than making actual predictions, and why this matters for AI-driven investment strategies

Medium · Machine Learning

Word Representation

Learn how word representation works in NLP and its importance in understanding human language, enabling applications like text classification and language translation

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)