Let's reproduce GPT-2 (124M)

Andrej Karpathy · Advanced ·🧠 Large Language Models ·1y ago
We reproduce the GPT-2 (124M) from scratch. This video covers the whole process: First we build the GPT-2 network, then we optimize its training to be really fast, then we set up the training run following the GPT-2 and GPT-3 paper and their hyperparameters, then we hit run, and come back the next morning to see our results, and enjoy some amusing model generations. Keep in mind that in some places this video builds on the knowledge from earlier videos in the Zero to Hero Playlist (see my channel). You could also see this video as building my nanoGPT repo, which by the end is about 90% similar. Links: - build-nanogpt GitHub repo, with all the changes in this video as individual commits: https://github.com/karpathy/build-nanogpt - nanoGPT repo: https://github.com/karpathy/nanoGPT - llm.c repo: https://github.com/karpathy/llm.c - my website: https://karpathy.ai - my twitter: https://twitter.com/karpathy - our Discord channel: https://discord.gg/3zy8kqD9Cp Supplementary links: - Attention is All You Need paper: https://arxiv.org/abs/1706.03762 - OpenAI GPT-3 paper: https://arxiv.org/abs/2005.14165 - OpenAI GPT-2 paper: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf- The GPU I'm training the model on is from Lambda GPU Cloud, I think the best and easiest way to spin up an on-demand GPU instance in the cloud that you can ssh to: https://lambdalabs.com Chapters: 00:00:00 intro: Let’s reproduce GPT-2 (124M) 00:03:39 exploring the GPT-2 (124M) OpenAI checkpoint 00:13:47 SECTION 1: implementing the GPT-2 nn.Module 00:28:08 loading the huggingface/GPT-2 parameters 00:31:00 implementing the forward pass to get logits 00:33:31 sampling init, prefix tokens, tokenization 00:37:02 sampling loop 00:41:47 sample, auto-detect the device 00:45:50 let’s train: data batches (B,T) → logits (B,T,C) 00:52:53 cross entropy loss 00:56:42 optimization loop: overfit a single batch 01:02:00 data loader lite 01:06:14 paramet
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

Stop Evaluating LLMs with “Vibe Checks”
Learn to evaluate LLMs effectively by building a decision-grade scorecard, moving beyond subjective 'vibe checks'
Towards Data Science
I gave the OpenAI SDK live web search by changing one line
Enable live web search in OpenAI SDK by modifying a single line of code, unlocking new possibilities for web-based applications
Dev.to · mv7
How I Made My Android App Discoverable on 4 LLMs in 24 Hours (llms.txt, IndexNow, JSON-LD, the Bing Cycle)
Make your Android app discoverable on 4 LLMs in 24 hours using llms.txt, IndexNow, JSON-LD, and the Bing Cycle
Dev.to · TAMSIV
What LLMs Can Actually Do for Your Business
Discover how LLMs can revolutionize your business by automating written content generation, improving email management, and enhancing overall productivity
Medium · AI

Chapters (13)

intro: Let’s reproduce GPT-2 (124M)
3:39 exploring the GPT-2 (124M) OpenAI checkpoint
13:47 SECTION 1: implementing the GPT-2 nn.Module
28:08 loading the huggingface/GPT-2 parameters
31:00 implementing the forward pass to get logits
33:31 sampling init, prefix tokens, tokenization
37:02 sampling loop
41:47 sample, auto-detect the device
45:50 let’s train: data batches (B,T) → logits (B,T,C)
52:53 cross entropy loss
56:42 optimization loop: overfit a single batch
1:02:00 data loader lite
1:06:14 paramet
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →