Let's build the GPT Tokenizer

Andrej Karpathy · Intermediate ·🧠 Large Language Models ·2y ago
The Tokenizer is a necessary and pervasive component of Large Language Models (LLMs), where it translates between strings and tokens (text chunks). Tokenizers are a completely separate stage of the LLM pipeline: they have their own training sets, training algorithms (Byte Pair Encoding), and after training implement two fundamental functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI. In the process, we will see that a lot of weird behaviors and problems of LLMs actually trac…
Watch on YouTube ↗ (saves to browser)

Chapters (21)

intro: Tokenization, GPT-2 paper, tokenization-related issues
5:50 tokenization by example in a Web UI (tiktokenizer)
14:56 strings in Python, Unicode code points
18:15 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32
22:47 daydreaming: deleting tokenization
23:50 Byte Pair Encoding (BPE) algorithm walkthrough
27:02 starting the implementation
28:35 counting consecutive pairs, finding most common pair
30:36 merging the most common pair
34:58 training the tokenizer: adding the while loop, compression ratio
39:20 tokenizer/LLM diagram: it is a completely separate stage
42:47 decoding tokens to strings
48:21 encoding strings to tokens
57:36 regex patterns to force splits across categories
1:11:38 tiktoken library intro, differences between GPT-2/GPT-4 regex
1:14:59 GPT-2 encoder.py released by OpenAI walkthrough
1:18:26 special tokens, tiktoken handling of, GPT-2/GPT-4 differences
1:25:28 minbpe exercise time! write your own GPT-4 tokenizer
1:28:42 sentencepiece library intro, used to train Llama 2 vocabulary
1:43:27 how to set vocabulary set? revisiting gpt.py transformer
1:48:11 training new tokens, example of prompt compression
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Next Up
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)