Decoding the Decoder LLM without de code: Ishan Anand

AI Engineer · Intermediate ·🧠 Large Language Models ·1y ago

Skills: LLM Foundations90%LLM Engineering80%Fine-tuning LLMs70%

Key Takeaways

The video demonstrates how to decode the Decoder LLM using an Excel spreadsheet, mapping words to subword units and numbers, and performing complex math using multi-headed attention and multi-layer perceptron. It also discusses the importance of understanding model architecture, tokenization, and embedding in large language models.

Full Transcript

[Music] hope you're all having a good conference and I hope you're ready because if you came to this conference or the AI engineering field without a machine learning degree then this is going to be your crash course and how machine learning models actually work under the hood let's let's bring up uh the slides there we go thank you okay so I'm isan and I'm dressed in Scrubs because today we're all going to be AI brain surgeons and our patient will be none other than gpt2 an early precursor to chat GPT and our operating table will be a table but it will be a table of numbers it will be an Excel spreadsheet this Excel spreadsheet implements all of gpt2 small entirely in pure Excel functions no API calls no python in theory you can understand gpt2 just by going tab by tab function by function through this spreadsheet but you want to hold on to those vlookups because there's over 150 tabs and over 124 million cells for every single one of the parameters in gpt2 small I will give you the abbreviated tour so we'll do three things today in our little med school first we'll study the anatomy of our patient how he's put together then we're going to put him through a virtual MRI to see how he thinks and then finally we're going to change his thinking with a little AI brain surgery okay let's start with Anatomy your probably familiar with the concept that large language models are trained to complete sentences to fill in the blank of phrases like this one Mike is quick he moves and as a human you might reasonably guess quickly but how do we get a computer to do that well here's a fill-in-the blank that computers are very good at 2+ 2 equal 4 right they're really good at math in fact you can make it very complex and they do it very well so what we're going to do in essence is we're going to take a word problem and turn it into a math problem in order to do that we take our whole sentence or phrases and we break them into subword units called tokens and then we map each of those tokens onto numbers called embeddings I've shown it for Simplicity here as a single number but the embedding for each token is many many many numbers as we'll see in a bit and then instead of the simple arithmetic shown here we're doing the much more complex math of multi-headed detention and the multi-layer perceptron multi-layer perceptron just another name for a neural network and then finally instead of getting one precise exact answer like you used to get in elementary school we're going to interpret the result as a probability distribution as to what the next token should be so here's our setup we get input text we turn that text into tokens we turn those tokens into numbers we do some number crunching and then we reverse the process we turn the numbers back out into tokens or text and then we get our next token prediction so this handy chart shows where each of those actions map to one or more tabs inside our friendly patient spreadsheet let's take a look so the first thing you do is we get our prompt right here the prompt is mic is quick he moves and then it will output after about 30 seconds since we're running in a spreadsheet don't use this in production the next predicted token of quickly so the first step is to split this into tokens now you see that every word here goes into a single token but that's not always the case in fact it's not uncommon to be two or more tokens let me give you some examples so here's another version of the sheet let me Zoom this up so you can see it a little better right I've put actually some fake words reinjury is a real word but funology isn't a real word uh but you know what it means right because it's the word fun with ology put together those are the morphemes as linguists like to call them and the tokenization algorithm actually is able to recognize that in some cases whoa there we go right there you see fun split into a fun anology if we Zoom that one up there we go but it doesn't always work so notice how reinjury got split up right here it's rain and jury and that's cuz the algorithm is a little dumb it just picks the most common subword units it finds in its iterations and it doesn't always map to your native intuition and so in practice machine learning experts feel like it's a necessary evil um and then the next step is we have to map each of these tokens to the embeddings so let's go back to the original one and that's in this tab here so we have each of our tokens in a separate row and then right here starting in column 3 is where our embeddings begin so this is row right here the second row is all the embeddings for Mike now in the case of gpt2 small the embeddings are 768 numbers so we're starting in column 3 so that means if we go to column 770 we will see the last end of this and so there's the end of our embeddings for Mike and then let's go back and each one of these again is the embedding for uh each token okay then we get to the layers this is the heart of the number crunching so there are two key components there's ATT tension and then the neural network or multi-layer perceptron and in the intention phase basically the tokens look around at the other tokens next to them to figure out the context in which they sit so the token he might look at the word Mike to look at the antecedent for its pronoun or moves might look look at the word quick because quick actually has multiple meanings quick can mean movement in physical space it can mean smart as in quick of wit it can mean a body part like the quick of your fingernail and in Shakespearean English it can mean alive or dead like the quick or the dead and seeing that the word moves here helps it disambiguate for the next layer the perceptron that oh we're talking about moving in physical space so maybe it's quickly or maybe it's fast or maybe it's around but it's certainly not something about your fingernail so let's see where this is all happening so these are our layers now there's 12 of them so this is block zero all the way to block 11 each one's a tab and then if you go up here we can't go through all of this in the time we have but this is one of the attention Heads This is Step seven this is where you can see where each token is paying attention to every other token and you'll notice that there's a bunch of zeros up at the top right and that's because no token is allowed to look forward they can only look backwards in time and you'll see here that Mike is looking at Mike 100% of the time higher values mean more attention these are all normalized to one uh here is the word he or the token he I should say and you'll notice 0.48 so about half of its attention is focused on its the antecedent of its pronoun now this is just one of many heads if I scroll to the right you'll see a lot more uh there aren't always as directly interpretable as that uh but it gives you a sense of how the attention mechanism works and then if we scroll further down we'll see the multi-layer perceptron right here if you know something about neural Nets you know they're just a large combination of multiplications and additions or a m Matrix multiply and so I don't know if you can see this in the back there's a m mult which is how you do an Excel Matrix multiply and that's basically multiplying it times its weight and then here we put it through its activation function to get the next prediction okay let's keep going okay next we have the language head and this is where we actually reverse the process so what we do is we take the last token and we uned it and reverse the embedding process we did before and we probabilistically look at which are the tokens the closest to the final last tokens un embedding and we interpret that as a probability distribution now if you're at temperature zero like we are in this spreadsheet then you just take the thing with the highest probability but if your temperature is higher then you sample it according to some algorithm like beam search let's take a look and we'll go here so again I don't know if you can see in the back but this function here is basically there we go this function in the back basically is taking block 11 the output of the very last block it's putting it through a step called layer Norm then we multiply it another m m times The unembedded Matrix and these are what are known as our logits and then to predict the next most likely token we just go to the next next one and if you can see this function it basically is looking at Max of the previous column you saw in the previous sheet um and it's taking the the highest probability token just like that and that's our our predicted token we get a token ID then we look it up in the Matrix and we know what the the next likely token is okay so that's the forward pass of how gpt2 works but how do all those components work together so let's take our patient and put him through a virtual MRI so we can see how he thinks before we do that there's something I forgot to mention these are called residual connections inside every layer there's an addition operation what this lets the model do is it lets it route information around and completely skip any part of these layers either a tension or the perceptron and so you can reimagine the model is actually a communication Network or a communication Stream So the residual stream here is every one of those tokens and information is flowing through them like an information Super Highway and what each layer is doing is we've got attention moving information across the lanes of this highway and then the perceptron trying to figure out what the likely token is for every single Lane of the highway but there are multiple of these layers so they're really reading and writing to each other information in this communication bus what we can do is we can do a technique called loit lens we can take the language head we talked about earlier and stick it in between every single layer of the network and what was it thinking at that layer so that's what I've done in this sheet so I give it the prompt if today is Tuesday tomorrow is and the predicted token is Wednesday and gpt2 does this correctly for all seven days what you see in this chart is essentially The Columns here from 3 through n are all those Lanes of the information Super Highway and for example here at block three this is the top most predicted token uh the last token position so it predicted not the second most likely word was going to be still then it was going to be just these are all wrong so let's look for we know is the right answer Wednesday so over here at block zero we see Wednesday it's at the bottom of the Tuesday stream for some reason on that Highway well it makes sense it'd be close to Tuesday and then it completely disappears and then oh over here towards the last few layers suddenly we see tomorrow forever Tuesday Friday it knows we're talking about time we're talking about days and it gets Wednesday but it's still the third most likely token and then finally it moves it up to the final position and then it locks it into place so what's going on on here well a series of researchers uh basically took this lit lens technique on steroids and isolated that only four components out of the entire network were responsible for doing this correctly over all seven days what they found was that all you needed was the perceptron from layer zero attention from layer 9 and actually only one head uh the perceptron from layer 9 and then attention from layer 10 and that's kind of what we saw in the sheet right at the top we saw Wednesday and then it disappeared until the later layers pulled it back up and up in probability at towards the end of the process so it's an example of where you can see each layer acting as a communication bus trying to jointly figure out and create what they call a circuit to accomplish a task okay we are now out of med school and ready for surgery so you may have heard about uh the pioneering work that anthropic has done about scaling monos semanticity this gave rise to what was known as Golden Gate CLA it was a version of claw that was very obsessed with the Golden Gate Bridge to some it felt like it thought it was the Golden Gate Bridge uh conceptually here's how this process worked you have a large language model and then you have this residual stream we talked about earlier and then you use another AI technique an autoencoder this one's a sparse Auto encoder and you ask it to look at the residual stream and separate it out into interpretable features and you then try and deduce what each feature is and then you can actually turn up and down each of these features back back in the residual stream in order to amplify or suppress certain Concepts it turns out a team of researchers led by Joseph Bloom Neil Nanda and others are building out sparse Auto encoder features for open source models like gpt2 small so here for example is layer 2's feature 7650 I don't know if you can see it in the back it's basically everything Jedi So Gone to our friendly patient again and I've taken the vector for that feature while we wait for Excel to wake up there it is that first row is essentially what they call the decoder Vector corresponding to Jedi and then I've basically multiply it by a coefficient and then I've basically formatted it so that I can inject it right into the residual stream so this is the start of the block if you can see that steer block to it's basically just taking that Vector I showed you and adding it into the residual stre simply Edition now we go to our prompt and originally normally us ask gpt2 Mike pulls out his makes sense he pulls out his phone but if we turn the Jedi steering Vector on I'll give you one guess what he's probably going to pull out let's see okay so now we hit calculate now um and this is where you get to witness the 30 seconds it takes um and while we wait for it to to run a couple notes so first of all the way anthropic did their steering was slightly different but Sim in spirit there's a few other ways to do this kind of steering one of those is called representation engineering where the steering Vector is deduced via PCA or principal component analysis and there's another technique called activation steering where what you do is you take the thing you want to amplify like Jedi and you would run the model through just on that token and then you'd run on something you might want to suppress like in this case phone and then you'd create a phone a Jedi minus phone vector and inject that into the residual stream okay there it is there it is Mike pulls out his lightsaber there we go we have done it our operation has been a success we've created the world's first gpt2 Jedi stick that on LM C Arena okay uh well hopefully I've given you a little better insight into how large language models work but also why they work but the root message I want to leave with is that to be a better AI engineer it does help to unlock the Black Box partly this about just knowing your tools and their behavior and their limitations better uh but also we're in a very fast-moving field and if you want to understand the latest research it helps to know how these work and then last but not least when you communicate with non-technical stakeholders there's very often a perception of magic and the more you can clear that up the more you can clear up misunderstandings I'll give you just one example of where this bubbles up where architecture bubbles up to how you use them so this is the uh instructions for RW KV which is a different type of model but the template for a normal Transformer is at the top the template for an RW KV uh prompt is at the bottom and what's interesting is that they recommend you swap the traditional order of instructions and context because the attention mechanism or the pseudo attention mechanism in RW KV can't look back the same way a regular Transformer can so it's a great example of where model architecture matters all the way up to prompting okay here are the references for the research we talked about today and then if you want to learn more you can go to spreadsheets or all. and you can download this spreadsheet and you can run it on your own device if you want to see me go through every single step of this spreadsheet I just launched a course on Maven today um and the link to it is on that website as well um and that's it thank you [Music]

Original Description

Spreadsheets are all you need: Decoding the Decoder LLM without de code The struggle to grasp the inner workings of AI models can leave even experienced engineers from non-ML backgrounds feeling lost in a sea of terminology and new concepts. What if the key to understanding the intricate mechanics of LLMs didn't require a Ph.D.? This session offers an innovative approach, employing spreadsheets to dissect and demystify the architecture of decoder-based LLMs using a fully working implementation of GPT-2 entirely in Excel. Attendees will tour through GPT-2's architecture from tokenization, embeddings, attention, multi-layer perceptron, all translated into the accessible format of spreadsheets with minimal abstractions to get in the way. By the end, you'll gain unparalleled insights into AI's backbone, transforming abstract concepts into tangible, understandable processes, without ever touching code. Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at https://www.ai.engineer/worldsfair/2024/schedule & join us at the AI Engineer World's Fair in 2025! Get your tickets today at https://ai.engineer/2025 About Ishan Ishan was most recently VP of Product for Edgio Applications, a platform that leverages edge computing, serverless, and AI/ML to enable enterprise teams to accelerate, host, and secure their high-stakes websites. Ishan joined Edgio via the acquisition of Layer0, where he was the CTO and co-founder. He's also the creator of Spreadsheets-are-all-you-need.ai which combines AI and Spreadsheets into a course that teaches how LLMs work through an implementation of GPT2 (an ancestor of ChatGPT) entirely in Excel. Ishan spoken at conferences such as JSMobile, Next.js, JSWorld, and React Day New York on web performance, Jamstack, and Core Web Vitals.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AI Engineer · AI Engineer · 53 of 60

← Previous Next →

AI Engineer Summit 2023 — DAY 1 Livestream

AI Engineer Summit 2023 — DAY 1 Livestream

AI Engineer Summit 2023 — DAY 2 Livestream

AI Engineer Summit 2023 — DAY 2 Livestream

Principles for Prompt Engineering - Karina Nguyen (Claude Instant @ Anthropic)

Principles for Prompt Engineering - Karina Nguyen (Claude Instant @ Anthropic)

Announcing the AI Engineer Network: Benjamin Dunphy

Announcing the AI Engineer Network: Benjamin Dunphy

The 1,000x AI Engineer: Swyx

The 1,000x AI Engineer: Swyx

Building AI For All: Amjad Masad & Michele Catasta

Building AI For All: Amjad Masad & Michele Catasta

The Age of the Agent: Flo Crivello

The Age of the Agent: Flo Crivello

See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman

See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman

Building Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase

Building Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase

Pydantic is all you need: Jason Liu

Pydantic is all you need: Jason Liu

Building Blocks for LLM Systems & Products: Eugene Yan

Building Blocks for LLM Systems & Products: Eugene Yan

The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer

The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer

Climbing the Ladder of Abstraction: Amelia Wattenberger

Climbing the Ladder of Abstraction: Amelia Wattenberger

Supabase Vector: The Postgres Vector database: Paul Copplestone

Supabase Vector: The Postgres Vector database: Paul Copplestone

[Workshop] AI Engineering 101

[Workshop] AI Engineering 101

The Hidden Life of Embeddings: Linus Lee

The Hidden Life of Embeddings: Linus Lee

[Workshop] AI Engineering 201: Inference

[Workshop] AI Engineering 201: Inference

The AI Pivot: With Chris White of Prefect & Bryan Bischof of Hex

The AI Pivot: With Chris White of Prefect & Bryan Bischof of Hex

The AI Evolution: Mario Rodriguez, GitHub

The AI Evolution: Mario Rodriguez, GitHub

Move Fast Break Nothing: Dedy Kredo

Move Fast Break Nothing: Dedy Kredo

AI Engineering 201: The Rest of the Owl

AI Engineering 201: The Rest of the Owl

Building Reactive AI Apps: Matt Welsh

Building Reactive AI Apps: Matt Welsh

Pragmatic AI with TypeChat: Daniel Rosenwasser

Pragmatic AI with TypeChat: Daniel Rosenwasser

Domain adaptation and fine-tuning for domain-specific LLMs: Abi Aryan

Domain adaptation and fine-tuning for domain-specific LLMs: Abi Aryan

Retrieval Augmented Generation in the Wild: Anton Troynikov

Retrieval Augmented Generation in the Wild: Anton Troynikov

Building Production-Ready RAG Applications: Jerry Liu

Building Production-Ready RAG Applications: Jerry Liu

120k players in a week: Lessons from the first viral CLIP app: Joseph Nelson

120k players in a week: Lessons from the first viral CLIP app: Joseph Nelson

The Weekend AI Engineer: Hassan El Mghari

The Weekend AI Engineer: Hassan El Mghari

Harnessing the Power of LLMs Locally: Mithun Hunsur

Harnessing the Power of LLMs Locally: Mithun Hunsur

Trust, but Verify: Shreya Rajpal

Trust, but Verify: Shreya Rajpal

Open Questions for AI Engineering: Simon Willison

Open Questions for AI Engineering: Simon Willison

Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD

Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD

GPT Web App Generator - 10,000 apps created in a month: Matija Sosic

GPT Web App Generator - 10,000 apps created in a month: Matija Sosic

Using AI to Build an Infinite Game: Jeff Schomay

Using AI to Build an Infinite Game: Jeff Schomay

How to Become an AI Engineer from a Fullstack Background - Reid Mayo

How to Become an AI Engineer from a Fullstack Background - Reid Mayo

The Code AI Maturity Model and What It Means For You: Ado Kukic

The Code AI Maturity Model and What It Means For You: Ado Kukic

AI Engineer World’s Fair 2024 - Keynotes & Multimodality track

AI Engineer World’s Fair 2024 - Keynotes & Multimodality track

From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet

From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet

The Making of Devin by Cognition AI: Scott Wu

The Making of Devin by Cognition AI: Scott Wu

The Future of Knowledge Assistants: Jerry Liu

The Future of Knowledge Assistants: Jerry Liu

Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney

Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney

Open Challenges for AI Engineering: Simon Willison

Open Challenges for AI Engineering: Simon Willison

Lessons From A Year Building With LLMs

Lessons From A Year Building With LLMs

From Software Developer to AI Engineer: Antje Barth

From Software Developer to AI Engineer: Antje Barth

Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner

Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner

Copilots Everywhere: Thomas Dohmke and Eugene Yan

Copilots Everywhere: Thomas Dohmke and Eugene Yan

Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han

Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han

Low Level Technicals of LLMs: Daniel Han

Low Level Technicals of LLMs: Daniel Han

Emergence Launch: AI Agents and the future enterprise: Dr. Satya Nitta

Emergence Launch: AI Agents and the future enterprise: Dr. Satya Nitta

How Codeium Breaks Through the Ceiling for Retrieval: Kevin Hou

How Codeium Breaks Through the Ceiling for Retrieval: Kevin Hou

What's new from Anthropic and what's next: Alex Albert

What's new from Anthropic and what's next: Alex Albert

Using agents to build an agent company: Joao Moura

Using agents to build an agent company: Joao Moura

Decoding the Decoder LLM without de code: Ishan Anand

Decoding the Decoder LLM without de code: Ishan Anand

Running AI Application in Minutes w/ AI Templates: Gabriela de Queiroz, Pamela Fox, Harald Kirschner

Running AI Application in Minutes w/ AI Templates: Gabriela de Queiroz, Pamela Fox, Harald Kirschner

Building with Anthropic Claude: Prompt Workshop with Zack Witten

Building with Anthropic Claude: Prompt Workshop with Zack Witten

Building Reliable Agentic Systems: Eno Reyes

Building Reliable Agentic Systems: Eno Reyes

10x Development: LLMs For the working Programmer - Manuel Odendahl

10x Development: LLMs For the working Programmer - Manuel Odendahl

Disrupting the $15 Trillion Construction Industry with Autonomous Agents: Dr. Sarah Buchner

Disrupting the $15 Trillion Construction Industry with Autonomous Agents: Dr. Sarah Buchner

Hypermode Launch: Kevin Van Gundy

Hypermode Launch: Kevin Van Gundy

Git push get an AI API: Ryan Fox-Tyler

Git push get an AI API: Ryan Fox-Tyler

This video teaches how to decode the Decoder LLM using an Excel spreadsheet and understand the importance of model architecture, tokenization, and embedding in large language models. It also discusses the use of sparse Auto encoder features and activation steering to improve model performance.

Key Takeaways

Split input text into tokens
Map tokens to numbers (embeddings)
Perform complex math using multi-headed attention and multi-layer perceptron
Reverse process to get next token prediction
Take the decoder Vector corresponding to a concept and multiply it by a coefficient
Inject the resulting Vector into the residual stream to steer the model towards the concept

💡 Understanding model architecture, tokenization, and embedding is crucial for working with large language models, and using sparse Auto encoder features and activation steering can improve model performance.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related Reads

IA local vs ChatGPT para empresas: qué usar y cuándo

Learn when to use local AI vs ChatGPT for your business and make an informed decision

MyClaw AI Isn’t Another Chatbot — It’s an AI Employee That Actually Gets Work Done

Learn how MyClaw AI is revolutionizing work productivity by acting as an AI employee that gets work done, unlike traditional chatbots

Why does AI love the em dash (—)??

Discover why AI models like ChatGPT overuse the em dash and how it affects writing style

Reddit r/artificial

RAG in Practice: From Text Search to Vector Databases

Learn how to apply RAG (Retrieval-Augmented Generation) in practice, moving from text search to vector databases, and improve your LLM skills

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)