The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer

AI Engineer · Intermediate ·🚀 Entrepreneurship & Startups ·2y ago

Skills: LLM Foundations80%Prompt Craft70%Prompting Basics60%Advanced Prompting50%

Key Takeaways

The video discusses the evolution of intelligent interfaces, specifically how AI-driven systems like ChatGPT have improved human-computer interaction, and explores the potential of multimodal input and output methods, including voice, gesture, and visual UI, to create more immersive and interactive experiences. It highlights the use of LLMs, pose detection, and other technologies to enable more natural and adaptive interactions.

Full Transcript

[Music] hi everybody thanks for having us here today um we're super excited to be here I'm Sam and I'm one of the co-founders of new computer and I'm Jason the other co-founder and we're really excited that we are starting today by letting you all see our pores up close um which is amazing um so you know when Sam and I started a new computer we we did so because we believed that for so long we've taken certain metaphors and abstractions and tools for granted and for the first time what feels like 40 years we can finally change all of that and we can start thinking from first principles what our relationship not only with Computing but with intelligence period should look like in the future so what do we mean by intelligence because uh you know sometimes I'm on the internet and I wonder if it even exists um well one way to think about intelligence is uh the ability to sort of take in lots of information different types different volumes from different sources um visualize as Dots here and sort of find ways to make sense of it all find ways to reason find ways to find meaning um and as human beings as carbon based life forms we do this through a process where first we use our senses to sort of perceive the world around us um then we you know process that information in our heads and then given what we think we then choose a reaction um so if we're lucky we are blessed with at least five senses six would I've had for margaritas um but as humans we sort of are in inherently capable of just processing all of this at the same time then that actually is how our short-term memory gets to work um and taking all this context and information we then get to form what's called a theory of mind um what is going on what is you know how is the world relating to me right now what should I be doing about it so we sense we think and then we react um and how do we react well um there's a lot of things right now uh but if we take it the way back to the Stone Age and we think real simple um a lot of how people used to react and communicate is just unintelligible grants um and then one day we that sort of evolved into a language as we know it um and to this day that's still something that we rely on to communicate and react to the world around us um and that's also how a lot of us think so we have language um but the language of communication is so much broader than just language we're standing here on stage right now I'm making eye contact with some of you nice shirt um and I'm making gestures I'm wearing these ridiculous gloves I'm looking at Sam I'm looking at things I'm pointing at things um and I can hear sort of laughter or I can hear people you know thinking I'm taking lots of information at once and right now I'm sensing thinking and reacting so this year um well last year Tech technically we saw a really amazing thing happen um kind of with the Advent of chat chat GPT I would say where we saw the beginnings of a computer start to approximate that same Loop where input was coming in in the form of language there was some reasoning process um however that actually works um and then the output felt also like language coming back to us and this was very inspiring to me and Jason and we've been spending a lot of time this past year thinking about what's next and how this gets to feel even more natural um for people to interact with computers specifically and so today we wanted to take you on a tour of a few demos um one um which you can do with the computer right now um and then a few which are kind of with futuristic or uh Next Generation Hardware which may be available soon and knowing that you're all Engineers we know that this will kind of get the Sparks flowing um the ideas flowing for seeing how like you might use um some of these things that are coming out soon or things that exist today to build things that feel more natural so I'll start by getting to a demo and I will say um this is a live audio visual demo so I am foolish enough to make that choice so we will see how it goes um before we show any demos it's prudent to point out that none of these represent the product we we are building they are simply yes pieces stories of inspiration so the point of this first demo is to imagine we have a lot of things where we're saying like Okay is text the right input is audio the right input and we've been thinking about it's not if those are the right things but when so in this case you'll see some measurements happening on the left here what's actually happening is that this has this has access to my camera and it's taking uh real-time pose measurements of where I am with relev relative to the screen so I just it knows I'm at the keyboard basically because it's making that assessment and you can see the reasoning in the side here where it's saying user is close to screen will use keyboard input user is facing screen will use text output and so this we're using an llm to actually make that choice as it as it goes to the response so let's try something else and again demo Gods be nice because this may not work at all but if I now walk away and it doesn't detect me anymore it should now actually start listening to me hello can you hear me are you going to respond I think that's a no it might not respond but basically what we are attempting to build here is like if I want to actually talk to the computer in a really Natural Way um if I'm there next to the keyboard I should not it should not be paying attention to my uh Voice or any sounds ambient sounds and if I walk away from the keyboard I might want to have a conversation with it like walk around the room it is listening it seems to not to decided not to actually talk back but oh it's talking is there something you need help sounds like an interesting project Samantha how is your talk going so far yay [Music] [Applause] yes you can see it paid attention and it decided to ignore me for a while but anyway this is this is just like a toy demo you can see here we have um this is how it's working kind of behind the scenes it's like trying to decide if I'm close to the keyboard facing the screen not facing the screen and use that all as inputs to decide whether it should talk to me or um just display the text as on the interface um cool so the reason why we think this is interesting is because we think you know people are naturally sensitive to other people and um we we think computers instead of asking people to adapt to computers to be like come up to me and type and whatever should find ways to try to adapt to circumstances and context of people exactly so um again here it's like in this case it's adapting to where I am by using the pose detection whether or not I'm actually in the process of talking to it to decide to update its own world State use an llm to actually do that and then use the llm to respond using the knowledge of that world State and so this is a really simple and as you can see kind of hacky demo that is what something you could build today in theory you could imagine how this could be like a really cool native way to uh interact with an Elm on your computer where you don't have to worry about the input at all um so again takeaways are consider like explicit inputs what I'm typing what I'm saying along with implicit where I am um there's other things you could do with that like tone and emotion detection um you could plug in a whole bunch of different signals that you want to extract from that and you can even imagine if I'm in the frame with Sam and the agent knows Sam and she had recently been complaining about me I should probably not bring that up until I leave the thing um Y and as we mentioned that um using it as a reasoning engine and then next one cool and yeah and then we're adapting so we want to get to the futuristic stuff um Jason has been spending a lot of time imagining this so he's going to walk you through a few things that might exist shortly in the near future when new hardware comes out so um we think future we still think the sensing thinking react Loop will will take place to preface all of this these are my personal speculative VI I not representative of anything that I think might actually happen um and this is a very conservative view of the next 1 to 12 months maybe so it's not a true future future AGI God worshipping type situation um so let's start with uh what I call like a social interface um we're all really excited about you know certain headsets being released at certain points um and one thing that I think is interesting about some headsets is they have sensors and they have hand tracking and eye TR tracking um and just like how I'm being expressive right now maybe there comes a day where I can be such with a computer that sort of lives with me so here my here I am in my apartment minding my own business um and my ex decides to uh FaceTime me um and now I've declined the call you know with his historically with deterministic interfaces um I would have had to like find the hang up button or go like hey Alexa decline call like thinking commands thinking computer speak but like as a person I can be like off you know I can be like I'm busy I can be like I'm sick you know like all this stuff the computer should be able to interpret for me and you know send send uh what's his name again tox toxic trashiest whatever on his merry way um so explicit social gestures can be a great way to determine user user intent like the way I just showed now um but we should also consider interpreting implicit gestures if I give a really fast gesture with a slow gesture my mood my tone how far away I am um but we should also be conscious of social cultural norms different gestures mean different things in different societies and it might mean you know as you scale your application re Hardware to different locals this is something that you should pay attention to now I want to move on to talk about what I call new physics and this part is super fun um this demo is based on um a little uh I think on iPad which you know has over five daily active users in the world it's very popular um and here I'm imagining like okay mid Journey if I was the pounder mid journey I would be putting all my resources and making some sort of uh mid Journey canvas app for iPad so in this one I've asked mid journey to create uh Balenciaga Naruto which now I'm realizing kind of looks like me um so let's think about the iPad it's like this big slab that you can like touch and Fiddle with right so what do I want to do okay I want to like edit this photo um but first I need to make space how do I do that well very easy you just you know um you can just zoom out and now you have extra space very obvious we do this all the time um I kind of think my cat would look really good in that outfit so I kind of want to find a way to do that here let me just ask AI real quick um hey random AI send me pictures of my cat and you know the AI knows me and has context and gives me pictures of my cat and then what do I do here well why can't we just take one of the photos and sort of just blend them with the other um and the metaphor you're seeing here as you sort of work with these photos they start glowing when you pick them up and what does light you guys know the Pink Floyd uh Dark Side of the Moon album cover like we're really familiar with the idea that light can sort of provide different colors and and sort of concentrate back into one form and we're leaning into that metaphor here implicitly um and so it's now created something that looks 50% human 50% cat 100% cringe I don't really like this how do we remix this what is a gesture what is the thing we do in real life that's remixing um for me it's a margarita and for Sam it's her morning hu we shake a blender bottle so why why can't we work with intelligent materials the same way that we work with real materials and just blend it out this is totally doable right now David why aren't you building this if you don't build this I'm going to build this it's fine um so you know here the metaphor is like what we're trying to say is you know think about familiar Universal metaphors like physics like light like metaballs like squishy like fog whatever because you know if you're designing an iPhone you have to be very cognizant of the qualities of Al aluminium and titanium to make an iPhone but generative intelligence is a probabilistic material that's sort of more fluid maybe it's fog maybe it's Mercury um and you know for this reason maybe metaphors that are really rigid like wood or paper or metal aren't the right metaphors to use for some of these experiences um so finally we want to walk you through an experience that's inherently mixed modal um/ mixed reality um let's imagine for a second there's a piece of Hardware coming out that's a wearable that has a camera on it and has a microphone and can maybe project things I don't know if such a thing will ever exist but let's imagine for a second it does um I'm sort of browsing this book this Beyonce tour book and I see these images that I find really inspiring um what I'm trying to do here is what if I could just point at something on my desk and say like this is cool and have the sort of device uh pick up on that and and and indicate that it's heard me and it's going to do something by by sort of projection mapping this sort of feedback um this is you know this demo doesn't really have sound but the way this would work is ideally a combination of voice and gesture at the same time um and obviously this gesture is really easy to make mistakes with so anytime you work with probabilistic materials you want to provide a graceful way out so in this case I've accidentally tapped this photo why can't I just flick it away like dust and be like that that's wrong I don't want to press an undo button I don't want to press command Z I just want to flick it away um really leaning to physics of it um so now that I found two pieces I'm kind of like okay I want to send this to two of my friends who there was a friend who I said I would do Halloween with but I can't remember their name um what do I do here I should ask AI I should be like who is that friend I said i' spend Halloween with and you notice here that like we're imagining sort of projection mapped UI pieces that can work with the context of the world you're in right now such that you don't have to go fish out a phone or use cumbersome voice commands um it just all sorts of naturally meling with the world um and you know crucially I think one point we want to make is voice in doesn't need to mean voice out gesture in doesn't need to mean gesture out and visual UI in does not need to mean visual UI out we can mix these modalities in real time for whatever makes sense in whatever context you're in so given that Interac that require multiple simultaneous inputs are now possible um it's our job as designers and developers to sort of think on behalf of the user and think when what's the appropriate output given the current context and be smart about it um yeah yeah so again the takeaways as we mentioned it's this idea of we have a lot of sensors and and contextual modalities available to us as ingredients even today there will be more tomorrow as you kind of saw with these upcoming uh potential Hardware releases um but even even now with a laptop with things like typing speed with things like uh the tone of voice there's a lot of ways that you could gather context and extract signals from it you could choose to process it in a variety different ways and so all of that can H now be passed to an llm and used in a reasoning layer which decides how um both to respond in words and also how to present that information um and so basically everything can now be an input and your output could be everywhere and have every format um at the same time one might say everything everywhere all at once well you want to be intentional with it you know you if someone wants to generate a photo on their Apple watch you're like why why like no use your freaking phone Jesus um anyway and the last thing we'll say is um probabilistic interfaces are hard because they have lots of different outputs so a really great way to sort of ground these interfaces is to lean into familiar metaphors whether they are from nature from physics or even from human-made tools and materials like buttons for now um and you know social norms is also a material that we work with right so your banking AI agent probably shouldn't be able to have a deep philosophical chat with you that just socially doesn't make sense that we exactly um but on the same note we we we've related all these interfaces to what humans perceive and experience now but what might truly intelligent interface look like in the future where if we think we where we are right now isomorphism what is the abstraction later above that and that's kind of for us to figure out um so with that um yeah think that's all thank [Applause] you

Original Description

ChatGPT was a turning point for consumer adoption of AI due to its easy-to-use interface. Just by changing some elements of design, interaction, and behavior, an existing model suddenly 'clicked' in terms of its utility for everyday people. What might be the next leap forward for making AI-driven applications even more accessible & intuitive? Join Sam & Jason as they showcase various demos of novel interaction & behavior paradigms for AI-driven applications. Recorded live in San Francisco at the AI Engineer Summit 2023. See the full schedule of talks at https://ai.engineer/summit/schedule & join us at the AI Engineer World's Fair in 2024! Get your tickets today at https://ai.engineer/worlds-fair About Samantha Whitmore Former Head of Engineering at Kensho, a startup which used early NLP techniques to organize information for financial clients, including Goldman Sachs, BAML, and JPMC. Kensho was acquired in 2018 by S&P Global for $550mm, at the time the largest Al acquisition in history. Subsequently was Head of Engineering at Maximus, a startup that partnered with IMAX to build video super-resolution software. Recently was one of the early core contributors to LangChain (pioneered the implementation of Memory). About Jason Yuan Former member of Apple Design Team where he worked on the future of computing and artificial intelligence. Founder and co-inventor of MakeSpace (now known as Sprout), a multi-player-first video conferencing platform. Creator of mercuryos.com and helped pioneer ideas in generative interfaces. Worked on projects with culture makers like Blackpink, Chanel, Vogue, Jackson Wang, The MET Gala, Nike, Christina Aguilera, FKA Twigs and The Weeknd.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AI Engineer · AI Engineer · 12 of 60

← Previous Next →

AI Engineer Summit 2023 — DAY 1 Livestream

AI Engineer Summit 2023 — DAY 1 Livestream

AI Engineer Summit 2023 — DAY 2 Livestream

AI Engineer Summit 2023 — DAY 2 Livestream

Principles for Prompt Engineering - Karina Nguyen (Claude Instant @ Anthropic)

Principles for Prompt Engineering - Karina Nguyen (Claude Instant @ Anthropic)

Announcing the AI Engineer Network: Benjamin Dunphy

Announcing the AI Engineer Network: Benjamin Dunphy

The 1,000x AI Engineer: Swyx

The 1,000x AI Engineer: Swyx

Building AI For All: Amjad Masad & Michele Catasta

Building AI For All: Amjad Masad & Michele Catasta

The Age of the Agent: Flo Crivello

The Age of the Agent: Flo Crivello

See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman

See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman

Building Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase

Building Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase

Pydantic is all you need: Jason Liu

Pydantic is all you need: Jason Liu

Building Blocks for LLM Systems & Products: Eugene Yan

Building Blocks for LLM Systems & Products: Eugene Yan

The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer

The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer

Climbing the Ladder of Abstraction: Amelia Wattenberger

Climbing the Ladder of Abstraction: Amelia Wattenberger

Supabase Vector: The Postgres Vector database: Paul Copplestone

Supabase Vector: The Postgres Vector database: Paul Copplestone

[Workshop] AI Engineering 101

[Workshop] AI Engineering 101

The Hidden Life of Embeddings: Linus Lee

The Hidden Life of Embeddings: Linus Lee

[Workshop] AI Engineering 201: Inference

[Workshop] AI Engineering 201: Inference

The AI Pivot: With Chris White of Prefect & Bryan Bischof of Hex

The AI Pivot: With Chris White of Prefect & Bryan Bischof of Hex

The AI Evolution: Mario Rodriguez, GitHub

The AI Evolution: Mario Rodriguez, GitHub

Move Fast Break Nothing: Dedy Kredo

Move Fast Break Nothing: Dedy Kredo

AI Engineering 201: The Rest of the Owl

AI Engineering 201: The Rest of the Owl

Building Reactive AI Apps: Matt Welsh

Building Reactive AI Apps: Matt Welsh

Pragmatic AI with TypeChat: Daniel Rosenwasser

Pragmatic AI with TypeChat: Daniel Rosenwasser

Domain adaptation and fine-tuning for domain-specific LLMs: Abi Aryan

Domain adaptation and fine-tuning for domain-specific LLMs: Abi Aryan

Retrieval Augmented Generation in the Wild: Anton Troynikov

Retrieval Augmented Generation in the Wild: Anton Troynikov

Building Production-Ready RAG Applications: Jerry Liu

Building Production-Ready RAG Applications: Jerry Liu

120k players in a week: Lessons from the first viral CLIP app: Joseph Nelson

120k players in a week: Lessons from the first viral CLIP app: Joseph Nelson

The Weekend AI Engineer: Hassan El Mghari

The Weekend AI Engineer: Hassan El Mghari

Harnessing the Power of LLMs Locally: Mithun Hunsur

Harnessing the Power of LLMs Locally: Mithun Hunsur

Trust, but Verify: Shreya Rajpal

Trust, but Verify: Shreya Rajpal

Open Questions for AI Engineering: Simon Willison

Open Questions for AI Engineering: Simon Willison

Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD

Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD

GPT Web App Generator - 10,000 apps created in a month: Matija Sosic

GPT Web App Generator - 10,000 apps created in a month: Matija Sosic

Using AI to Build an Infinite Game: Jeff Schomay

Using AI to Build an Infinite Game: Jeff Schomay

How to Become an AI Engineer from a Fullstack Background - Reid Mayo

How to Become an AI Engineer from a Fullstack Background - Reid Mayo

The Code AI Maturity Model and What It Means For You: Ado Kukic

The Code AI Maturity Model and What It Means For You: Ado Kukic

AI Engineer World’s Fair 2024 - Keynotes & Multimodality track

AI Engineer World’s Fair 2024 - Keynotes & Multimodality track

From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet

From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet

The Making of Devin by Cognition AI: Scott Wu

The Making of Devin by Cognition AI: Scott Wu

The Future of Knowledge Assistants: Jerry Liu

The Future of Knowledge Assistants: Jerry Liu

Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney

Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney

Open Challenges for AI Engineering: Simon Willison

Open Challenges for AI Engineering: Simon Willison

Lessons From A Year Building With LLMs

Lessons From A Year Building With LLMs

From Software Developer to AI Engineer: Antje Barth

From Software Developer to AI Engineer: Antje Barth

Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner

Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner

Copilots Everywhere: Thomas Dohmke and Eugene Yan

Copilots Everywhere: Thomas Dohmke and Eugene Yan

Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han

Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han

Low Level Technicals of LLMs: Daniel Han

Low Level Technicals of LLMs: Daniel Han

Emergence Launch: AI Agents and the future enterprise: Dr. Satya Nitta

Emergence Launch: AI Agents and the future enterprise: Dr. Satya Nitta

How Codeium Breaks Through the Ceiling for Retrieval: Kevin Hou

How Codeium Breaks Through the Ceiling for Retrieval: Kevin Hou

What's new from Anthropic and what's next: Alex Albert

What's new from Anthropic and what's next: Alex Albert

Using agents to build an agent company: Joao Moura

Using agents to build an agent company: Joao Moura

Decoding the Decoder LLM without de code: Ishan Anand

Decoding the Decoder LLM without de code: Ishan Anand

Running AI Application in Minutes w/ AI Templates: Gabriela de Queiroz, Pamela Fox, Harald Kirschner

Running AI Application in Minutes w/ AI Templates: Gabriela de Queiroz, Pamela Fox, Harald Kirschner

Building with Anthropic Claude: Prompt Workshop with Zack Witten

Building with Anthropic Claude: Prompt Workshop with Zack Witten

Building Reliable Agentic Systems: Eno Reyes

Building Reliable Agentic Systems: Eno Reyes

10x Development: LLMs For the working Programmer - Manuel Odendahl

10x Development: LLMs For the working Programmer - Manuel Odendahl

Disrupting the $15 Trillion Construction Industry with Autonomous Agents: Dr. Sarah Buchner

Disrupting the $15 Trillion Construction Industry with Autonomous Agents: Dr. Sarah Buchner

Hypermode Launch: Kevin Van Gundy

Hypermode Launch: Kevin Van Gundy

Git push get an AI API: Ryan Fox-Tyler

Git push get an AI API: Ryan Fox-Tyler

The video discusses the evolution of intelligent interfaces and explores the potential of multimodal input and output methods to create more immersive and interactive experiences. It highlights the use of LLMs and other technologies to enable more natural and adaptive interactions. By understanding the concepts and techniques presented in the video, viewers can build intelligent interfaces using LLMs and design adaptive AI systems.

Key Takeaways

Use LLMs to make decisions and respond to user input
Adapt to user's context and circumstances using pose detection and other technologies
Use multiple signals to extract information from user's behavior
Blend photos of a cat and a human to create a new image using AI
Combine voice and gesture input to control a device

💡 Probabilistic interfaces can be grounded by leaning into familiar metaphors from nature, physics, or human-made tools and materials.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related Reads

When a Startup Stops Pushing and the Market Starts Pulling

Learn to recognize when the market starts pulling your startup, and how to adapt to this shift in demand

Medium · Startup

Why Most Startup Ideas Fail Before a Single Line of Code Is Written

Learn why most startup ideas fail before coding begins and how to evaluate ideas effectively

Medium · Startup

Tech Startup Branding: A Founder’s Guide to Building a Brand People Actually Trust

Learn how to build a trustworthy brand for your tech startup, beyond just having a great product

Medium · Startup

When Should a Startup Stop Maintaining Model Integrations?

Learn when to stop maintaining model integrations in your startup to avoid infrastructure complexity

Watch this before applying for jobs as a developer.