Decoding the Decoder LLM without de code: Ishan Anand
Key Takeaways
The video demonstrates how to decode the Decoder LLM using an Excel spreadsheet, mapping words to subword units and numbers, and performing complex math using multi-headed attention and multi-layer perceptron. It also discusses the importance of understanding model architecture, tokenization, and embedding in large language models.
Full Transcript
[Music] hope you're all having a good conference and I hope you're ready because if you came to this conference or the AI engineering field without a machine learning degree then this is going to be your crash course and how machine learning models actually work under the hood let's let's bring up uh the slides there we go thank you okay so I'm isan and I'm dressed in Scrubs because today we're all going to be AI brain surgeons and our patient will be none other than gpt2 an early precursor to chat GPT and our operating table will be a table but it will be a table of numbers it will be an Excel spreadsheet this Excel spreadsheet implements all of gpt2 small entirely in pure Excel functions no API calls no python in theory you can understand gpt2 just by going tab by tab function by function through this spreadsheet but you want to hold on to those vlookups because there's over 150 tabs and over 124 million cells for every single one of the parameters in gpt2 small I will give you the abbreviated tour so we'll do three things today in our little med school first we'll study the anatomy of our patient how he's put together then we're going to put him through a virtual MRI to see how he thinks and then finally we're going to change his thinking with a little AI brain surgery okay let's start with Anatomy your probably familiar with the concept that large language models are trained to complete sentences to fill in the blank of phrases like this one Mike is quick he moves and as a human you might reasonably guess quickly but how do we get a computer to do that well here's a fill-in-the blank that computers are very good at 2+ 2 equal 4 right they're really good at math in fact you can make it very complex and they do it very well so what we're going to do in essence is we're going to take a word problem and turn it into a math problem in order to do that we take our whole sentence or phrases and we break them into subword units called tokens and then we map each of those tokens onto numbers called embeddings I've shown it for Simplicity here as a single number but the embedding for each token is many many many numbers as we'll see in a bit and then instead of the simple arithmetic shown here we're doing the much more complex math of multi-headed detention and the multi-layer perceptron multi-layer perceptron just another name for a neural network and then finally instead of getting one precise exact answer like you used to get in elementary school we're going to interpret the result as a probability distribution as to what the next token should be so here's our setup we get input text we turn that text into tokens we turn those tokens into numbers we do some number crunching and then we reverse the process we turn the numbers back out into tokens or text and then we get our next token prediction so this handy chart shows where each of those actions map to one or more tabs inside our friendly patient spreadsheet let's take a look so the first thing you do is we get our prompt right here the prompt is mic is quick he moves and then it will output after about 30 seconds since we're running in a spreadsheet don't use this in production the next predicted token of quickly so the first step is to split this into tokens now you see that every word here goes into a single token but that's not always the case in fact it's not uncommon to be two or more tokens let me give you some examples so here's another version of the sheet let me Zoom this up so you can see it a little better right I've put actually some fake words reinjury is a real word but funology isn't a real word uh but you know what it means right because it's the word fun with ology put together those are the morphemes as linguists like to call them and the tokenization algorithm actually is able to recognize that in some cases whoa there we go right there you see fun split into a fun anology if we Zoom that one up there we go but it doesn't always work so notice how reinjury got split up right here it's rain and jury and that's cuz the algorithm is a little dumb it just picks the most common subword units it finds in its iterations and it doesn't always map to your native intuition and so in practice machine learning experts feel like it's a necessary evil um and then the next step is we have to map each of these tokens to the embeddings so let's go back to the original one and that's in this tab here so we have each of our tokens in a separate row and then right here starting in column 3 is where our embeddings begin so this is row right here the second row is all the embeddings for Mike now in the case of gpt2 small the embeddings are 768 numbers so we're starting in column 3 so that means if we go to column 770 we will see the last end of this and so there's the end of our embeddings for Mike and then let's go back and each one of these again is the embedding for uh each token okay then we get to the layers this is the heart of the number crunching so there are two key components there's ATT tension and then the neural network or multi-layer perceptron and in the intention phase basically the tokens look around at the other tokens next to them to figure out the context in which they sit so the token he might look at the word Mike to look at the antecedent for its pronoun or moves might look look at the word quick because quick actually has multiple meanings quick can mean movement in physical space it can mean smart as in quick of wit it can mean a body part like the quick of your fingernail and in Shakespearean English it can mean alive or dead like the quick or the dead and seeing that the word moves here helps it disambiguate for the next layer the perceptron that oh we're talking about moving in physical space so maybe it's quickly or maybe it's fast or maybe it's around but it's certainly not something about your fingernail so let's see where this is all happening so these are our layers now there's 12 of them so this is block zero all the way to block 11 each one's a tab and then if you go up here we can't go through all of this in the time we have but this is one of the attention Heads This is Step seven this is where you can see where each token is paying attention to every other token and you'll notice that there's a bunch of zeros up at the top right and that's because no token is allowed to look forward they can only look backwards in time and you'll see here that Mike is looking at Mike 100% of the time higher values mean more attention these are all normalized to one uh here is the word he or the token he I should say and you'll notice 0.48 so about half of its attention is focused on its the antecedent of its pronoun now this is just one of many heads if I scroll to the right you'll see a lot more uh there aren't always as directly interpretable as that uh but it gives you a sense of how the attention mechanism works and then if we scroll further down we'll see the multi-layer perceptron right here if you know something about neural Nets you know they're just a large combination of multiplications and additions or a m Matrix multiply and so I don't know if you can see this in the back there's a m mult which is how you do an Excel Matrix multiply and that's basically multiplying it times its weight and then here we put it through its activation function to get the next prediction okay let's keep going okay next we have the language head and this is where we actually reverse the process so what we do is we take the last token and we uned it and reverse the embedding process we did before and we probabilistically look at which are the tokens the closest to the final last tokens un embedding and we interpret that as a probability distribution now if you're at temperature zero like we are in this spreadsheet then you just take the thing with the highest probability but if your temperature is higher then you sample it according to some algorithm like beam search let's take a look and we'll go here so again I don't know if you can see in the back but this function here is basically there we go this function in the back basically is taking block 11 the output of the very last block it's putting it through a step called layer Norm then we multiply it another m m times The unembedded Matrix and these are what are known as our logits and then to predict the next most likely token we just go to the next next one and if you can see this function it basically is looking at Max of the previous column you saw in the previous sheet um and it's taking the the highest probability token just like that and that's our our predicted token we get a token ID then we look it up in the Matrix and we know what the the next likely token is okay so that's the forward pass of how gpt2 works but how do all those components work together so let's take our patient and put him through a virtual MRI so we can see how he thinks before we do that there's something I forgot to mention these are called residual connections inside every layer there's an addition operation what this lets the model do is it lets it route information around and completely skip any part of these layers either a tension or the perceptron and so you can reimagine the model is actually a communication Network or a communication Stream So the residual stream here is every one of those tokens and information is flowing through them like an information Super Highway and what each layer is doing is we've got attention moving information across the lanes of this highway and then the perceptron trying to figure out what the likely token is for every single Lane of the highway but there are multiple of these layers so they're really reading and writing to each other information in this communication bus what we can do is we can do a technique called loit lens we can take the language head we talked about earlier and stick it in between every single layer of the network and what was it thinking at that layer so that's what I've done in this sheet so I give it the prompt if today is Tuesday tomorrow is and the predicted token is Wednesday and gpt2 does this correctly for all seven days what you see in this chart is essentially The Columns here from 3 through n are all those Lanes of the information Super Highway and for example here at block three this is the top most predicted token uh the last token position so it predicted not the second most likely word was going to be still then it was going to be just these are all wrong so let's look for we know is the right answer Wednesday so over here at block zero we see Wednesday it's at the bottom of the Tuesday stream for some reason on that Highway well it makes sense it'd be close to Tuesday and then it completely disappears and then oh over here towards the last few layers suddenly we see tomorrow forever Tuesday Friday it knows we're talking about time we're talking about days and it gets Wednesday but it's still the third most likely token and then finally it moves it up to the final position and then it locks it into place so what's going on on here well a series of researchers uh basically took this lit lens technique on steroids and isolated that only four components out of the entire network were responsible for doing this correctly over all seven days what they found was that all you needed was the perceptron from layer zero attention from layer 9 and actually only one head uh the perceptron from layer 9 and then attention from layer 10 and that's kind of what we saw in the sheet right at the top we saw Wednesday and then it disappeared until the later layers pulled it back up and up in probability at towards the end of the process so it's an example of where you can see each layer acting as a communication bus trying to jointly figure out and create what they call a circuit to accomplish a task okay we are now out of med school and ready for surgery so you may have heard about uh the pioneering work that anthropic has done about scaling monos semanticity this gave rise to what was known as Golden Gate CLA it was a version of claw that was very obsessed with the Golden Gate Bridge to some it felt like it thought it was the Golden Gate Bridge uh conceptually here's how this process worked you have a large language model and then you have this residual stream we talked about earlier and then you use another AI technique an autoencoder this one's a sparse Auto encoder and you ask it to look at the residual stream and separate it out into interpretable features and you then try and deduce what each feature is and then you can actually turn up and down each of these features back back in the residual stream in order to amplify or suppress certain Concepts it turns out a team of researchers led by Joseph Bloom Neil Nanda and others are building out sparse Auto encoder features for open source models like gpt2 small so here for example is layer 2's feature 7650 I don't know if you can see it in the back it's basically everything Jedi So Gone to our friendly patient again and I've taken the vector for that feature while we wait for Excel to wake up there it is that first row is essentially what they call the decoder Vector corresponding to Jedi and then I've basically multiply it by a coefficient and then I've basically formatted it so that I can inject it right into the residual stream so this is the start of the block if you can see that steer block to it's basically just taking that Vector I showed you and adding it into the residual stre simply Edition now we go to our prompt and originally normally us ask gpt2 Mike pulls out his makes sense he pulls out his phone but if we turn the Jedi steering Vector on I'll give you one guess what he's probably going to pull out let's see okay so now we hit calculate now um and this is where you get to witness the 30 seconds it takes um and while we wait for it to to run a couple notes so first of all the way anthropic did their steering was slightly different but Sim in spirit there's a few other ways to do this kind of steering one of those is called representation engineering where the steering Vector is deduced via PCA or principal component analysis and there's another technique called activation steering where what you do is you take the thing you want to amplify like Jedi and you would run the model through just on that token and then you'd run on something you might want to suppress like in this case phone and then you'd create a phone a Jedi minus phone vector and inject that into the residual stream okay there it is there it is Mike pulls out his lightsaber there we go we have done it our operation has been a success we've created the world's first gpt2 Jedi stick that on LM C Arena okay uh well hopefully I've given you a little better insight into how large language models work but also why they work but the root message I want to leave with is that to be a better AI engineer it does help to unlock the Black Box partly this about just knowing your tools and their behavior and their limitations better uh but also we're in a very fast-moving field and if you want to understand the latest research it helps to know how these work and then last but not least when you communicate with non-technical stakeholders there's very often a perception of magic and the more you can clear that up the more you can clear up misunderstandings I'll give you just one example of where this bubbles up where architecture bubbles up to how you use them so this is the uh instructions for RW KV which is a different type of model but the template for a normal Transformer is at the top the template for an RW KV uh prompt is at the bottom and what's interesting is that they recommend you swap the traditional order of instructions and context because the attention mechanism or the pseudo attention mechanism in RW KV can't look back the same way a regular Transformer can so it's a great example of where model architecture matters all the way up to prompting okay here are the references for the research we talked about today and then if you want to learn more you can go to spreadsheets or all. and you can download this spreadsheet and you can run it on your own device if you want to see me go through every single step of this spreadsheet I just launched a course on Maven today um and the link to it is on that website as well um and that's it thank you [Music]
Original Description
Spreadsheets are all you need: Decoding the Decoder LLM without de code
The struggle to grasp the inner workings of AI models can leave even experienced engineers from non-ML backgrounds feeling lost in a sea of terminology and new concepts. What if the key to understanding the intricate mechanics of LLMs didn't require a Ph.D.? This session offers an innovative approach, employing spreadsheets to dissect and demystify the architecture of decoder-based LLMs using a fully working implementation of GPT-2 entirely in Excel. Attendees will tour through GPT-2's architecture from tokenization, embeddings, attention, multi-layer perceptron, all translated into the accessible format of spreadsheets with minimal abstractions to get in the way. By the end, you'll gain unparalleled insights into AI's backbone, transforming abstract concepts into tangible, understandable processes, without ever touching code.
Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at https://www.ai.engineer/worldsfair/2024/schedule & join us at the AI Engineer World's Fair in 2025! Get your tickets today at https://ai.engineer/2025
About Ishan
Ishan was most recently VP of Product for Edgio Applications, a platform that leverages edge computing, serverless, and AI/ML to enable enterprise teams to accelerate, host, and secure their high-stakes websites. Ishan joined Edgio via the acquisition of Layer0, where he was the CTO and co-founder. He's also the creator of Spreadsheets-are-all-you-need.ai which combines AI and Spreadsheets into a course that teaches how LLMs work through an implementation of GPT2 (an ancestor of ChatGPT) entirely in Excel. Ishan spoken at conferences such as JSMobile, Next.js, JSWorld, and React Day New York on web performance, Jamstack, and Core Web Vitals.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from AI Engineer · AI Engineer · 53 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
▶
54
55
56
57
58
59
60
AI Engineer Summit 2023 — DAY 1 Livestream
AI Engineer
AI Engineer Summit 2023 — DAY 2 Livestream
AI Engineer
Principles for Prompt Engineering - Karina Nguyen (Claude Instant @ Anthropic)
AI Engineer
Announcing the AI Engineer Network: Benjamin Dunphy
AI Engineer
The 1,000x AI Engineer: Swyx
AI Engineer
Building AI For All: Amjad Masad & Michele Catasta
AI Engineer
The Age of the Agent: Flo Crivello
AI Engineer
See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman
AI Engineer
Building Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase
AI Engineer
Pydantic is all you need: Jason Liu
AI Engineer
Building Blocks for LLM Systems & Products: Eugene Yan
AI Engineer
The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer
AI Engineer
Climbing the Ladder of Abstraction: Amelia Wattenberger
AI Engineer
Supabase Vector: The Postgres Vector database: Paul Copplestone
AI Engineer
[Workshop] AI Engineering 101
AI Engineer
The Hidden Life of Embeddings: Linus Lee
AI Engineer
[Workshop] AI Engineering 201: Inference
AI Engineer
The AI Pivot: With Chris White of Prefect & Bryan Bischof of Hex
AI Engineer
The AI Evolution: Mario Rodriguez, GitHub
AI Engineer
Move Fast Break Nothing: Dedy Kredo
AI Engineer
AI Engineering 201: The Rest of the Owl
AI Engineer
Building Reactive AI Apps: Matt Welsh
AI Engineer
Pragmatic AI with TypeChat: Daniel Rosenwasser
AI Engineer
Domain adaptation and fine-tuning for domain-specific LLMs: Abi Aryan
AI Engineer
Retrieval Augmented Generation in the Wild: Anton Troynikov
AI Engineer
Building Production-Ready RAG Applications: Jerry Liu
AI Engineer
120k players in a week: Lessons from the first viral CLIP app: Joseph Nelson
AI Engineer
The Weekend AI Engineer: Hassan El Mghari
AI Engineer
Harnessing the Power of LLMs Locally: Mithun Hunsur
AI Engineer
Trust, but Verify: Shreya Rajpal
AI Engineer
Open Questions for AI Engineering: Simon Willison
AI Engineer
Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD
AI Engineer
GPT Web App Generator - 10,000 apps created in a month: Matija Sosic
AI Engineer
Using AI to Build an Infinite Game: Jeff Schomay
AI Engineer
How to Become an AI Engineer from a Fullstack Background - Reid Mayo
AI Engineer
The Code AI Maturity Model and What It Means For You: Ado Kukic
AI Engineer
AI Engineer World’s Fair 2024 - Keynotes & Multimodality track
AI Engineer
From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet
AI Engineer
The Making of Devin by Cognition AI: Scott Wu
AI Engineer
The Future of Knowledge Assistants: Jerry Liu
AI Engineer
Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney
AI Engineer
Open Challenges for AI Engineering: Simon Willison
AI Engineer
Lessons From A Year Building With LLMs
AI Engineer
From Software Developer to AI Engineer: Antje Barth
AI Engineer
Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner
AI Engineer
Copilots Everywhere: Thomas Dohmke and Eugene Yan
AI Engineer
Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han
AI Engineer
Low Level Technicals of LLMs: Daniel Han
AI Engineer
Emergence Launch: AI Agents and the future enterprise: Dr. Satya Nitta
AI Engineer
How Codeium Breaks Through the Ceiling for Retrieval: Kevin Hou
AI Engineer
What's new from Anthropic and what's next: Alex Albert
AI Engineer
Using agents to build an agent company: Joao Moura
AI Engineer
Decoding the Decoder LLM without de code: Ishan Anand
AI Engineer
Running AI Application in Minutes w/ AI Templates: Gabriela de Queiroz, Pamela Fox, Harald Kirschner
AI Engineer
Building with Anthropic Claude: Prompt Workshop with Zack Witten
AI Engineer
Building Reliable Agentic Systems: Eno Reyes
AI Engineer
10x Development: LLMs For the working Programmer - Manuel Odendahl
AI Engineer
Disrupting the $15 Trillion Construction Industry with Autonomous Agents: Dr. Sarah Buchner
AI Engineer
Hypermode Launch: Kevin Van Gundy
AI Engineer
Git push get an AI API: Ryan Fox-Tyler
AI Engineer
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking
Medium · AI
Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking
Medium · LLM
The 2026 AI Model Release Race: Every Major LLM Launch You Need to Know
Dev.to AI
Call GPT, Claude, and Gemini from one API key — a 3-step setup
Dev.to AI
🎓
Tutor Explanation
DeepCamp AI