Let's build the GPT Tokenizer

Andrej Karpathy · Intermediate ·🧠 Large Language Models ·2y ago

Skills: LLM Foundations90%LLM Engineering80%

The Tokenizer is a necessary and pervasive component of Large Language Models (LLMs), where it translates between strings and tokens (text chunks). Tokenizers are a completely separate stage of the LLM pipeline: they have their own training sets, training algorithms (Byte Pair Encoding), and after training implement two fundamental functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI. In the process, we will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely. Chapters: 00:00:00 intro: Tokenization, GPT-2 paper, tokenization-related issues 00:05:50 tokenization by example in a Web UI (tiktokenizer) 00:14:56 strings in Python, Unicode code points 00:18:15 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32 00:22:47 daydreaming: deleting tokenization 00:23:50 Byte Pair Encoding (BPE) algorithm walkthrough 00:27:02 starting the implementation 00:28:35 counting consecutive pairs, finding most common pair 00:30:36 merging the most common pair 00:34:58 training the tokenizer: adding the while loop, compression ratio 00:39:20 tokenizer/LLM diagram: it is a completely separate stage 00:42:47 decoding tokens to strings 00:48:21 encoding strings to tokens 00:57:36 regex patterns to force splits across categories 01:11:38 tiktoken library intro, differences between GPT-2/GPT-4 regex 01:14:59 GPT-2 encoder.py released by OpenAI walkthrough 01:18:26 special tokens, tiktoken handling of, GPT-2/GPT-4 differences 01:25:28 minbpe exercise time! write your own GPT-4 tokenizer 01:28:42 sentencepiece library intro, used to train Llama 2 vocabulary 01:43:27 how to set vocabulary set? revisiting gpt.py transformer 01:48:11 training new tokens, example of prompt compression 0

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Andrej Karpathy · Andrej Karpathy · 14 of 17

← Previous Next →

Stable diffusion dreams of steam punk neural networks

Stable diffusion dreams of steam punk neural networks

Andrej Karpathy

Stable diffusion dreams of "blueberry spaghetti" for one night

Stable diffusion dreams of "blueberry spaghetti" for one night

Andrej Karpathy

The spelled-out intro to neural networks and backpropagation: building micrograd

The spelled-out intro to neural networks and backpropagation: building micrograd

Andrej Karpathy

Stable diffusion dreams of tattoos

Stable diffusion dreams of tattoos

Andrej Karpathy

Stable diffusion dreams of steampunk brains

Stable diffusion dreams of steampunk brains

Andrej Karpathy

Stable diffusion dreams of psychedelic faces

Stable diffusion dreams of psychedelic faces

Andrej Karpathy

The spelled-out intro to language modeling: building makemore

The spelled-out intro to language modeling: building makemore

Andrej Karpathy

Building makemore Part 2: MLP

Building makemore Part 2: MLP

Andrej Karpathy

Building makemore Part 3: Activations & Gradients, BatchNorm

Building makemore Part 3: Activations & Gradients, BatchNorm

Andrej Karpathy

Building makemore Part 4: Becoming a Backprop Ninja

Building makemore Part 4: Becoming a Backprop Ninja

Andrej Karpathy

Building makemore Part 5: Building a WaveNet

Building makemore Part 5: Building a WaveNet

Andrej Karpathy

Let's build GPT: from scratch, in code, spelled out.

Let's build GPT: from scratch, in code, spelled out.

Andrej Karpathy

[1hr Talk] Intro to Large Language Models

[1hr Talk] Intro to Large Language Models

Andrej Karpathy

Let's build the GPT Tokenizer

Let's build the GPT Tokenizer

Andrej Karpathy

Let's reproduce GPT-2 (124M)

Let's reproduce GPT-2 (124M)

Andrej Karpathy

Deep Dive into LLMs like ChatGPT

Deep Dive into LLMs like ChatGPT

Andrej Karpathy

Andrej Karpathy

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

How I Made My Android App Discoverable on 4 LLMs in 24 Hours (llms.txt, IndexNow, JSON-LD, the Bing Cycle)

Make your Android app discoverable on 4 LLMs in 24 hours using llms.txt, IndexNow, JSON-LD, and the Bing Cycle

Dev.to · TAMSIV

What LLMs Can Actually Do for Your Business

Discover how LLMs can revolutionize your business by automating written content generation, improving email management, and enhancing overall productivity

MiMo-V2.5-Pro: The Long-Context LLM I’d Actually Test Before Paying More for Claude or GPT

Learn about MiMo-V2.5-Pro, a long-context LLM, and why you should test it before paying for alternatives like Claude or GPT

Medium · Programming

25 Deep Learning Questions Every GenAI Engineer Gets Asked (And How to Answer Them)- Part I

Learn how to answer 25 deep learning questions for GenAI engineers, covering topics like RAG pipelines and multi-agent workflows

Medium · Deep Learning

Chapters (21)

intro: Tokenization, GPT-2 paper, tokenization-related issues

5:50 tokenization by example in a Web UI (tiktokenizer)

14:56 strings in Python, Unicode code points

18:15 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32

22:47 daydreaming: deleting tokenization

23:50 Byte Pair Encoding (BPE) algorithm walkthrough

27:02 starting the implementation

28:35 counting consecutive pairs, finding most common pair

30:36 merging the most common pair

34:58 training the tokenizer: adding the while loop, compression ratio

39:20 tokenizer/LLM diagram: it is a completely separate stage

42:47 decoding tokens to strings

48:21 encoding strings to tokens

57:36 regex patterns to force splits across categories

1:11:38 tiktoken library intro, differences between GPT-2/GPT-4 regex

1:14:59 GPT-2 encoder.py released by OpenAI walkthrough

1:18:26 special tokens, tiktoken handling of, GPT-2/GPT-4 differences

1:25:28 minbpe exercise time! write your own GPT-4 tokenizer

1:28:42 sentencepiece library intro, used to train Llama 2 vocabulary

1:43:27 how to set vocabulary set? revisiting gpt.py transformer

1:48:11 training new tokens, example of prompt compression

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)