Let's reproduce GPT-2 (124M)
We reproduce the GPT-2 (124M) from scratch. This video covers the whole process: First we build the GPT-2 network, then we optimize its training to be really fast, then we set up the training run following the GPT-2 and GPT-3 paper and their hyperparameters, then we hit run, and come back the next morning to see our results, and enjoy some amusing model generations. Keep in mind that in some places this video builds on the knowledge from earlier videos in the Zero to Hero Playlist (see my channel). You could also see this video as building my nanoGPT repo, which by the end is about 90% similar.
Links:
- build-nanogpt GitHub repo, with all the changes in this video as individual commits: https://github.com/karpathy/build-nanogpt
- nanoGPT repo: https://github.com/karpathy/nanoGPT
- llm.c repo: https://github.com/karpathy/llm.c
- my website: https://karpathy.ai
- my twitter: https://twitter.com/karpathy
- our Discord channel: https://discord.gg/3zy8kqD9Cp
Supplementary links:
- Attention is All You Need paper: https://arxiv.org/abs/1706.03762
- OpenAI GPT-3 paper: https://arxiv.org/abs/2005.14165 - OpenAI GPT-2 paper: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf- The GPU I'm training the model on is from Lambda GPU Cloud, I think the best and easiest way to spin up an on-demand GPU instance in the cloud that you can ssh to: https://lambdalabs.com
Chapters:
00:00:00 intro: Let’s reproduce GPT-2 (124M)
00:03:39 exploring the GPT-2 (124M) OpenAI checkpoint
00:13:47 SECTION 1: implementing the GPT-2 nn.Module
00:28:08 loading the huggingface/GPT-2 parameters
00:31:00 implementing the forward pass to get logits
00:33:31 sampling init, prefix tokens, tokenization
00:37:02 sampling loop
00:41:47 sample, auto-detect the device
00:45:50 let’s train: data batches (B,T) → logits (B,T,C)
00:52:53 cross entropy loss
00:56:42 optimization loop: overfit a single batch
01:02:00 data loader lite
01:06:14 paramet
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Andrej Karpathy · Andrej Karpathy · 15 of 17
1
2
3
4
5
6
7
8
9
10
11
12
13
14
▶
16
17
Stable diffusion dreams of steam punk neural networks
Andrej Karpathy
Stable diffusion dreams of "blueberry spaghetti" for one night
Andrej Karpathy
The spelled-out intro to neural networks and backpropagation: building micrograd
Andrej Karpathy
Stable diffusion dreams of tattoos
Andrej Karpathy
Stable diffusion dreams of steampunk brains
Andrej Karpathy
Stable diffusion dreams of psychedelic faces
Andrej Karpathy
The spelled-out intro to language modeling: building makemore
Andrej Karpathy
Building makemore Part 2: MLP
Andrej Karpathy
Building makemore Part 3: Activations & Gradients, BatchNorm
Andrej Karpathy
Building makemore Part 4: Becoming a Backprop Ninja
Andrej Karpathy
Building makemore Part 5: Building a WaveNet
Andrej Karpathy
Let's build GPT: from scratch, in code, spelled out.
Andrej Karpathy
[1hr Talk] Intro to Large Language Models
Andrej Karpathy
Let's build the GPT Tokenizer
Andrej Karpathy
Let's reproduce GPT-2 (124M)
Andrej Karpathy
Deep Dive into LLMs like ChatGPT
Andrej Karpathy
How I use LLMs
Andrej Karpathy
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Moonshot AI and the Rise of Beijing’s Open-Source Frontier: What a $20B Valuation Signals for…
Medium · LLM
“LLMs Do Not Remember Anything”: They only process the context we give them.
Dev.to AI
Why My Coding Assistant Started Replying in Korean When I Typed Chinese
Towards Data Science
Claude AI vs ChatGPT: What I Noticed After Using Both for Real Projects
Medium · ChatGPT
Chapters (13)
intro: Let’s reproduce GPT-2 (124M)
3:39
exploring the GPT-2 (124M) OpenAI checkpoint
13:47
SECTION 1: implementing the GPT-2 nn.Module
28:08
loading the huggingface/GPT-2 parameters
31:00
implementing the forward pass to get logits
33:31
sampling init, prefix tokens, tokenization
37:02
sampling loop
41:47
sample, auto-detect the device
45:50
let’s train: data batches (B,T) → logits (B,T,C)
52:53
cross entropy loss
56:42
optimization loop: overfit a single batch
1:02:00
data loader lite
1:06:14
paramet
🎓
Tutor Explanation
DeepCamp AI