Python & GPT-3 for Absolute Beginners #3 - What the heck are embeddings?

David Shapiro · Beginner ·🧠 Large Language Models ·3y ago

Key Takeaways

This video covers the basics of embeddings, semantic meaning, and vector representation using Python and GPT-3, with practical examples and code snippets to demonstrate how embeddings can be used for text comparison, search, and classification.

Full Transcript

morning everybody david shapiro here for my third video in the zero to python and gpt3 boot camp um what the heck are embeddings i get this question all the time it is by far the biggest hottest topic so this is why i'm doing it as episode three but before we get started i'm going to ask that uh you consider liking and subscribing this video and also jump over to my patreon to support me there if i get to enough support who knows maybe i can do this full time one day anyways let's go ahead and jump into the video what are embeddings first i need to give you a little bit of background so basically what an embedding is it's a vector so what is a vector let's just start from scratch a vector is any string of numbers in an array this is a vector here let me make this a little bit bigger so that you can see it okay so we'll do vector so hashtag vector let me set the language to python so that it looks right ta-da okay this is a vector oops this is a vector so the mathematical definition of a vector it is a one-dimensional matrix so i'll do aka one dimensional matrix or a list okay so that's what a vector is great so what then is an embedding so the difference between a vector and an embedding is mathematically they're the same but an embedding has semantic meaning um and if you want to take a deeper dive i'll have this link in in the description this is uh tensorflow.org which is made by google they really advanced this technology um a few years ago with universal sentence encoder this is the like the progenitor technology that allowed gpd3 to exist and they did this starting in i think about 2014. um anyways so if you want to deep dive here follow this link but uh just for the sake of this video i'll show you kind of a short version so embedding equals vector with semantic meaning um so older older nlp stuff like nltk would do like um word webs and stuff where like semantic meaning was always relevant to other words so like cat is a type of mammal for instance it's also a type of animal it's associated with pet but that was not very efficient those technologies are really old and uh they thought it was going to revolutionize nlp it certainly did a lot but it was not as flexible as neural networks are today so an embedding is just a vector with semantic meaning so how do we like what does that mean okay so let's say each position each index in our um in our vector has a meaning so we'll say um we'll do x and y so we're going to populate this this vector with just two values and we'll say that position x equals uh social power right so max of 1.0 min of negative 1.0 so we're basically going to make our own embedding um and then uh this is based on one of the original examples that google used to use position y equals gender gender or sex so we'll say max is 1.0 min of negative 1.0 so we'll say like one equals ultramasculine and negative one equals ultra feminine okay so we have this this two um this this one by two matrix and we're going to use it to represent a person so or semantic meaning so if we have a semantic meaning that is one by one and we say that social power is uh a maximum of one and and 1.0 is ultra-masculine right so what would this be so we can we know that okay so what who has maximum social power theoretical um so that sounds like an emperor so we'll say this is the padesha emperor of the known universe known universe uh dune reference there so maximum social power possible and also maximum masculinity i actually don't know if the if the emperor was maximum masculinity he might actually be closer like 0.5 so we'll just say that because like you think ultra masculine you think of like what if what if the emperor was like the rock um so then let's duplicate that we'll say uh 1.1 punisha emperor if he was dwayne johnson um and so there we have we have our first two embeddings um now okay so what if we do the other one what if we go the opposite way so like negative 1.0 that would be like a peasant right or actually probably someone who doesn't have free will so like a prisoner or something um and then we say like zero so this is like um someone with uh no free will or agency um and also gender neutral right so that's what a semantic meaning is now with gpt3 the smallest one if you go to their embeddings the smallest one ada has 1024 dimensions so what happens is these models are trained to break down semantic meaning into many many different dimensions and da vinci has 12 000 dimensions so here we are just we're just doing uh doing it with one or sorry two vectors of uh with two dimensions each so you know this is this is like super super simple all right so now you now you know the basic of what i mean when i say vector or embedding so again the only difference between a vector and an embedding is that an embedding has uh has semantic meaning um and each of those positions is somewhat abstract um okay so then what do you do with it though like how do you compare one of these to another i'm glad you asked so we're going to do a basic um we're going to do a basic similarity search we're going to do base a classification problem so let's start with oh first let me introduce um i've added import numpy as np so this is um a standard math package i'll say module for python and so when you do import something as something else this allows you to refer to it as shorthand so if i double click on that you see that numpy is used down here so the way to do the way to use these vectors is to compare them with a dot product um and then the dot product the higher the dot product the more similar the vectors are that's it it's that simple um so i've added this function it's super simple all it does is return the dot product um between these two vectors uh and then i've added the gp gpt3 embedding where you just pass it a string um and i've got the engine already set to text similarity eta um this will suffice for for many things especially if you're just doing like a single sentence um if you look at like the original um universal sentence encoder um they did like uh i think like 124 and 256 and 512 vectors or uh dimensions um the newest newer ones are probably bigger than that all these links will be in the description if you need them all right if name equals main so what are we going to do what do we want to do with this well first let's just do a really simple dot product of um of like uh where did it go these so let's let's say like okay what's the difference between like the emperor of the known universe if he was dwayne johnson versus if it was just you know the normal super masculine one um so we'll go here and we'll say um we'll say we'll say v1 for vector 1 equals 1.0 1.0 v2 equals uh one dot uh oh and then 0.5 so then let's do print similarity so then this will be emperor dwayne and then this will just be whoops patasha emperor um so then we'll return similarity of v1 v2 all right so let's run this real quick cd python gpd3 tutorial python embedding all right so the the dot product is 1.5 so that's that is um that is the level of similarity now what happens if we do a third one we'll do the um we'll do the androgynous so we'll do negative 1.0 for social power and then 0.0 for um for so this is like the gender neutral prisoner all right so then we'll duplicate this and we'll swap out um [Music] v2 for v3 so we'll get two outputs in this one so i'll just do hashtag or pound sign 1.5 so that's the score that returned oops i need to save it there we go okay so then you see like oh wait hold on now this is way different so this one is negative 1.0 um okay so now you're starting to get the idea of what it means to compare these now when you have you know a thousand or twelve thousand dimensions these numbers get much more um much more nuanced okay so how do we how do we use this right how do we um how do we use this to do search or comparison or clustering so i had this idea so we're going to do categories equals and we'll do plant reptile mammal fish so we're just gonna do we're gonna do four categories and okay so then what what do we do with that um we'll do a while loop well true and we'll do a equals input um enter a life form here oops so for python you're supposed like the the default is that you use single quotes um and i can't remember off the top of my head right now when you're supposed to switch between single quotes and double quotes um but like powershell for instance the default is you're supposed to use double quotes um okay so we we we get a we get a life form and then what do we do with that um so then we'll do uh vector equals um gpd3 embedding of our input so we'll get an embedding back and um so just so they can see what this looks like let's just do a quick um print out of this so we saved it um cls python embedding enter a life form here uh we'll do bald eagle because this is america and it spits out a huge number you see how big this number is well it's actually a list of numbers but you get the idea um so this is the ada version semantic embedding of bald eagle it is 1024 floating point numbers between negative 0.1 or negative 1.0 and 1.0 we can do this with davinci and it'll be even longer um why did this get messed up oh there we go so um so there you have it that's what that's what an embedding looks like so it just it shoots it back real quick all right but what are we going to do with this so what we want to do is we want to find we want to classify what did i put in and we want to match it to one of these categories so how do we do that well the first thing we can do is and this will be really inefficient because what i'm going to do is i'm going to get a vector for these each time but really what we should do in the long run in a longer video or in a future video we'll store these we'll store the embeddings for each of these okay so let's do a function so we'll match let's see result equals match class and so what we'll do is we'll pass our initial vector and our categories and so this is the vector so we're going to get the vector for whatever life form we put in was and then we're going to pass these categories so then we'll do def uh match class and then we'll do um vector and our categories and i know in a previous video i said that it's best practices not to reuse the same variable names but since i'm not modifying these i'm just reading them it's not that big a deal but it's still poor practice so just keep that in mind i don't always follow the rules um okay so match class so the first thing that we need to do is um is for each of these categories um so we'll do classes equals list and then for um c in categories we'll do um we'll just copy this real quick vector equals gpt3 embedding but instead of a we'll do c so that's this so that well this is a for loop which means that we're going to iterate through each of these items and we're going to say okay what is my vector and actually you know what if i got to write this i might as well put this down below um so we declare our categories so we'll do for c in categories um and we'll do this here so let's run this once so this is another rule of thumb if you're going to run a piece of code more than once you put it in a place that one it'll only run once if you only need to but also if you need to call it multiple times then you then you declare it as a function so let's just clean this up here first we'll do it a little bit better all right so we get the vector for the category and so then what do we do well we want to save it in this new variable called classes and so we'll say info equals and we'll say we'll we'll declare this as a dictionary so this is this is an explicit uh definition so info equals dict and then we'll say info um actually that's not the way that i prefer to declare these so you can implicitly declare with the curly brackets so that says this is a dictionary and so we'll say the dictionary so we'll say category equals c and then we'll say vector equals vector and so basically what this does is it creates what's what's also called a hash table where it's like okay every instance of info is going to have a an item named category and so if we need to get the category we we just call that and then we also can call the vector and so then we'll say class oops class is dot append info and let me show you what this looks like so then we'll do print classes and this is not going to be pretty because it's going to be pretty big um and then we'll just add a quick exit 0 so that we won't even dive into this but this will just show you what we're doing and let me do a quick time check we're already at 17 minutes okay so let's do python embedding whoops what did i do all right okay so what this is going to do is it's going to create a list of dictionaries and each of those dictionaries is going to contain a couple pieces of information so here's what it looks like category mammal and then here's the vector that declares that it's a mammal okay cool all right and we're almost done i promise so let's go ahead and we can get rid of that just comment those out all right so result equals match class vector and then we'll just say classes so this will just be a really quick search so we've already done all the embeddings that we need to um so then let's just match it okay classes all right so what we'll do then is we'll say results equals list and then for c in classes we will want to get the dot product um so we'll say uh brain i need more coffee that's what i need so for c in classes we will do score equals and we'll do similarity because that's the function that we declared right here so we'll do similarity of our vector and then our class our c vector because remember we we added we we have a dictionary that has um the vector for each each category here um so then we get the score and then what we do is we'll do info equals um category equals c category oops and then we'll do oh score equals score and then results dot append info um and there you have it that is pretty much it then we'll do return results and so then we'll print the result at the end and that should be it let me do a quick test python embedding so it's going to get the thing in the background so let's say bald eagle um ooh that doesn't quite look right so let me let me show you another trick from p print import p print so p print is called pretty print um and so we're gonna we're gonna change this to pretty print which will instead of it kind of being all in one line here it'll make it a little bit prettier so let's oops do this again um enter a live form here bald eagle okay so then we can say um the category plant 0.77 not so good category reptile 0.82 category mammal 0.80 category fish 0.80 now you might notice that i did something wrong here i didn't include fish or a bird as a category so um interestingly though the bald eagle which is a descendant of the dinosaurs the common ancestor it's closest to reptile so let's run this again because i'm silly and we'll add bird as a category so let's try that again bald eagle and then we say bird there we go .87 so bald eagle is semantically closest to bird tada all right what else do we have let's do a komodo dragon and as we'd expect that's the highest score so it's a reptile what else do we have let's do a shark so a shark it is a 0.91 similarity to a fish now there was a question i think it was on my discord server we were we were talking about like okay how do you do semantic search for memories and stuff what if we add another category that is not an animal kingdom let's say if we do pet versus wild animal right so let's do that um do a clear screen real quick python embedding okay so let's do a cat so the if we put in a cat it is it's closer to a pet a score of 0.93 than it is to wild animal of 0.85 right and it's also a mammal although this says it's very close to also being a fish that's kind of funny um oh wait nope it says it's closer to bird than it is to mammal interesting okay so i have seen some people complain about these embeddings and i'm beginning to see what they mean um let's do a domestic cat see if that clears it clears it up so domestic cat .83 to mammal okay that's a little bit better so we're a little bit we're a little bit more specific and then it's just slightly more pet than wild animal um let's do uh let's do a golden retriever so that's not even that's not even the name of a species that's a breed of dog okay so golden retriever is a wild animal according to this um uh and it is also a reptile okay so this is pretty funny um i'm wondering what happens let's let's see if this gets better if we uh if we do a different embedding engine okay so we're doing ada and we're doing text similarity but there's also text search um so this one i don't think will apply um but let's let's let's do this let's upgrade to babbage and just see what happens this is actually kind of funny i did not expect this to happen um all right so let's change our engine to text similarity babbage this is kind of funny cls so it works in some cases but not in others it'll take a little bit longer so let's start with bald eagle so bald eagle um let's see reptile mammal fish bird um so bald eagle is just slightly more reptile than anything else that's interesting um it is a wild animal though so we got we got bald eagle is more associated with the term wild animal than it is with pet um let's see uh then what was it golden retriever let's see if we get the right um let's see it is also a reptile interesting um and it's still a wild animal okay so going up to babbage didn't help um i wonder if i'm using this wrong anyways you get the idea you can do this uh for for text search and classification um this is actually kind of funny i will need to do some research and figure out what i've done wrong here i'm sure someone will let me know but anyways thanks for watching

Original Description

The Kickstarter for my Post-Labor Economics book is live! https://www.kickstarter.com/projects/daveshap/labor-zero
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from David Shapiro · David Shapiro · 45 of 60

1 Raven MVP Demo 2021-04-02
Raven MVP Demo 2021-04-02
David Shapiro
2 Get Started with Raven AGI
Get Started with Raven AGI
David Shapiro
3 Coding Raven's Encyclopedia Service (v.1)
Coding Raven's Encyclopedia Service (v.1)
David Shapiro
4 Prototype AGI demo - Natural Language Cognitive Architecture "NLCA" running on GPT-3
Prototype AGI demo - Natural Language Cognitive Architecture "NLCA" running on GPT-3
David Shapiro
5 Raven Release 1 Deep Dive
Raven Release 1 Deep Dive
David Shapiro
6 Fine-tuning GPT-3 to generate questions about anything
Fine-tuning GPT-3 to generate questions about anything
David Shapiro
7 Fine-tuning GPT-3 for benevolent and trustworthy AGI
Fine-tuning GPT-3 for benevolent and trustworthy AGI
David Shapiro
8 Implementing Natural Language Cognitive Architecture with GPT-3 and the "nexus" concept
Implementing Natural Language Cognitive Architecture with GPT-3 and the "nexus" concept
David Shapiro
9 5 Tips and Misconceptions about Finetuning GPT-3
5 Tips and Misconceptions about Finetuning GPT-3
David Shapiro
10 How to create synthetic datasets with GPT-3
How to create synthetic datasets with GPT-3
David Shapiro
11 What is a heuristic imperative? What imperatives should we give AGI?
What is a heuristic imperative? What imperatives should we give AGI?
David Shapiro
12 Talking Philosophy with GPT-3
Talking Philosophy with GPT-3
David Shapiro
13 Talking Boundaries and Consent with GPT-3
Talking Boundaries and Consent with GPT-3
David Shapiro
14 Convergence and acceleration towards AGI (or Artificial Cognitive Entities)
Convergence and acceleration towards AGI (or Artificial Cognitive Entities)
David Shapiro
15 GPT-3 for Writing Dialog
GPT-3 for Writing Dialog
David Shapiro
16 Co-writing flash fiction with GPT-3
Co-writing flash fiction with GPT-3
David Shapiro
17 From zero to finetuned model in 1 hour with GPT-3. Generate a movie script from any premise!
From zero to finetuned model in 1 hour with GPT-3. Generate a movie script from any premise!
David Shapiro
18 GPT-3 Working Session: Finetune an information companion chatbot in 30 minutes (RESEARCH ONLY)
GPT-3 Working Session: Finetune an information companion chatbot in 30 minutes (RESEARCH ONLY)
David Shapiro
19 What is "toxic stoicism"? Talking philosophy with GPT-3
What is "toxic stoicism"? Talking philosophy with GPT-3
David Shapiro
20 Billion-dollar GPT-3 startup! Fix education with an expert tutor chatbot!
Billion-dollar GPT-3 startup! Fix education with an expert tutor chatbot!
David Shapiro
21 Finetune GPT-3 to write an entire coherent novel (part 1)
Finetune GPT-3 to write an entire coherent novel (part 1)
David Shapiro
22 Concepts in Neuroscience and Cognition - Deficits of GPT-3 and the path to AGI and ACE
Concepts in Neuroscience and Cognition - Deficits of GPT-3 and the path to AGI and ACE
David Shapiro
23 Finetuning GPT-3 to be a master tutor that can handle any topic and hostile students
Finetuning GPT-3 to be a master tutor that can handle any topic and hostile students
David Shapiro
24 Testing "Theory of Mind" in GPT-3 - making fully aligned ACOG (Artificial Cognitive Entities)
Testing "Theory of Mind" in GPT-3 - making fully aligned ACOG (Artificial Cognitive Entities)
David Shapiro
25 Finetune GPT-3 to write an entire coherent novel (part 2)
Finetune GPT-3 to write an entire coherent novel (part 2)
David Shapiro
26 Finetune multiple cognitive tasks with GPT-3 on medical texts (and reduce hallucination)
Finetune multiple cognitive tasks with GPT-3 on medical texts (and reduce hallucination)
David Shapiro
27 Finetune GPT-3 to write a novel - Part 3 (IT WORKS!!!) ...at least a little bit
Finetune GPT-3 to write a novel - Part 3 (IT WORKS!!!) ...at least a little bit
David Shapiro
28 How will we know when we've invented AGI? How will we know it is complete?
How will we know when we've invented AGI? How will we know it is complete?
David Shapiro
29 Finetuning a Creative Writing Coach in GPT-3 - Part 1
Finetuning a Creative Writing Coach in GPT-3 - Part 1
David Shapiro
30 Finetune GPT-3 to write a coherent novel - Part 4 (success! with minor bugs...)
Finetune GPT-3 to write a coherent novel - Part 4 (success! with minor bugs...)
David Shapiro
31 Recursively summarize text of any length with GPT-3
Recursively summarize text of any length with GPT-3
David Shapiro
32 Finetuning a Creative Writing Coach in GPT-3 - Part 2
Finetuning a Creative Writing Coach in GPT-3 - Part 2
David Shapiro
33 Increasingly Verbose Bot with GPT-3 - Expand any word or phrase into a whole paragraph
Increasingly Verbose Bot with GPT-3 - Expand any word or phrase into a whole paragraph
David Shapiro
34 Metaprompting with GPT-3 to dynamically generate arguments
Metaprompting with GPT-3 to dynamically generate arguments
David Shapiro
35 I'm taking a short break from research and YouTube
I'm taking a short break from research and YouTube
David Shapiro
36 Are LaMDA or GPT-3 sentient? No, but...
Are LaMDA or GPT-3 sentient? No, but...
David Shapiro
37 Can GPT-3 generate training data? Short answer? Yes! Here's why that's a legit methodology...
Can GPT-3 generate training data? Short answer? Yes! Here's why that's a legit methodology...
David Shapiro
38 DALLE2 Style Tags Tutorial - "Elven archer in a sunny forest" with different tags
DALLE2 Style Tags Tutorial - "Elven archer in a sunny forest" with different tags
David Shapiro
39 Many of you have asked for it: Join my new research Discord! Link in description
Many of you have asked for it: Join my new research Discord! Link in description
David Shapiro
40 Answer complex questions from an arbitrarily large set of documents with vector search and GPT-3
Answer complex questions from an arbitrarily large set of documents with vector search and GPT-3
David Shapiro
41 Fixing "goldfish memory" with GPT-3 and external sources of information in a chatbot - part 1
Fixing "goldfish memory" with GPT-3 and external sources of information in a chatbot - part 1
David Shapiro
42 Fixing "goldfish memory" with GPT-3 and external sources of information in a chatbot - part 2
Fixing "goldfish memory" with GPT-3 and external sources of information in a chatbot - part 2
David Shapiro
43 Python & GPT-3 for Absolute Beginners #1 - Setting up your environment
Python & GPT-3 for Absolute Beginners #1 - Setting up your environment
David Shapiro
44 Python & GPT-3 for Absolute Beginners #2 - Your first chatbot
Python & GPT-3 for Absolute Beginners #2 - Your first chatbot
David Shapiro
Python & GPT-3 for Absolute Beginners #3 - What the heck are embeddings?
Python & GPT-3 for Absolute Beginners #3 - What the heck are embeddings?
David Shapiro
46 Introducing the RAVEN MVP - a general purpose AI companion (with a live DEMO)
Introducing the RAVEN MVP - a general purpose AI companion (with a live DEMO)
David Shapiro
47 I needed SQLITE but for vectors so I wrote it myself. Now it's on PyPI - introducing VDBLITE
I needed SQLITE but for vectors so I wrote it myself. Now it's on PyPI - introducing VDBLITE
David Shapiro
48 Prompt Engineering 101: Autocomplete, Zero-shot, One-shot, and Few-shot prompting
Prompt Engineering 101: Autocomplete, Zero-shot, One-shot, and Few-shot prompting
David Shapiro
49 Prompt Engineering 101: Introduction to CODEX
Prompt Engineering 101: Introduction to CODEX
David Shapiro
50 Prompt Engineering 101: Summarizing, Extraction, and Rewriting
Prompt Engineering 101: Summarizing, Extraction, and Rewriting
David Shapiro
51 Summarize product reviews with GPT-3 fast and easy, get product insights and improvements fast!
Summarize product reviews with GPT-3 fast and easy, get product insights and improvements fast!
David Shapiro
52 Finetuning GPT-3 101: Synthesizing Training Data
Finetuning GPT-3 101: Synthesizing Training Data
David Shapiro
53 Finetuning GPT-3 101: Augmenting Training Data
Finetuning GPT-3 101: Augmenting Training Data
David Shapiro
54 Finetuning GPT-3 101: Using Your Finetuned Model
Finetuning GPT-3 101: Using Your Finetuned Model
David Shapiro
55 Modeling different viewpoints with GPT-3 for automatic debates
Modeling different viewpoints with GPT-3 for automatic debates
David Shapiro
56 Finetune a perfect email generator in GPT-3 - take any input and generate a great email
Finetune a perfect email generator in GPT-3 - take any input and generate a great email
David Shapiro
57 Research Update: Nexus microservice for Artificial Cognition + microservices architecture (MARAGI)
Research Update: Nexus microservice for Artificial Cognition + microservices architecture (MARAGI)
David Shapiro
58 Research Update: Microservices! Text-based simulation, Embeddings, and Nexus
Research Update: Microservices! Text-based simulation, Embeddings, and Nexus
David Shapiro
59 It's alive! The first 3 microservices are up and running!
It's alive! The first 3 microservices are up and running!
David Shapiro
60 What is a Microservice? What does it have to do with AGI?
What is a Microservice? What does it have to do with AGI?
David Shapiro

This video teaches the basics of embeddings and semantic meaning, and how to use GPT-3 and Python for text comparison, search, and classification. It covers key concepts such as vector representation, dot product, and retrieval augmented generation, with practical examples and code snippets.

Key Takeaways
  1. Create a vector with semantic meaning
  2. Assign specific values to each index in the vector
  3. Use the vector to represent a concept or object with multiple attributes
  4. Import numpy as np
  5. Define a function to calculate the dot product between two vectors
  6. Pass a string to the GPT-3 embedding and get a vector
  7. Calculate the dot product between two vectors and determine their similarity
💡 Embeddings can be used to represent concepts or objects with multiple attributes, and can be compared using the dot product to determine their similarity.

Related Reads

📰
Sub-10ms AI Workflows: Accelerating sim.ai with On-Device Semantic Search using Moss
Learn how to accelerate AI workflows with on-device semantic search using Moss, achieving sub-10ms response times and improving user experience
Medium · Machine Learning
📰
Anthropic Built a $100M Club for Its Smartest AI. You’re Probably Not In It.
Learn about Anthropic's Project Glasswing, a $100M club for its smartest AI, and understand the strategy behind it
Medium · LLM
📰
Stop Guessing: Guaranteed Structured Output from LLMs in Node.js
Learn to guarantee structured output from LLMs in Node.js and stop parsing JSON manually
Dev.to · Hardik Mehta
📰
Spring AI Tutorial — Your First REST Endpoint with OpenAI (2026)
Build a REST endpoint with Spring Boot 3 and OpenAI to create an LLM-powered API, leveraging the power of AI in your applications
Dev.to AI
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →