Release Notes: Gemini's multimodality
Skills:
Multimodal LLMs85%
Key Takeaways
Explores Gemini's multimodal capabilities for building proactive AI assistants
Full Transcript
Gemini from the beginning was built to be a multimodal model. If we want to build AGI and like powerful AI systems that can perform these general human tasks, vision is a core component of the human experience. These models should be able to see and perceive the world like we do. I think vision still feels like one of the areas where there's the biggest gap between the model capability and the products that people are building. I think we're like very early in that world because it takes time for people to build intuition for what these models can do. This productivity bit is what I'm most excited about. We see like a world where everything is vision and these models can see your screen, see the world like just just like we can, but they're also domain experts in every field, which I think is a future that I'm super excited by. [Music] Hey everyone, welcome back to release notes. Today we're chatting with Ani Baraputi who's the multimodal vision product lead for Gemini and also newly the product lead for Gemini model behavior. So thanks for thanks for coming on to to talk all things Gemini multimodal. Great to be here. Thanks for having me. I think Gemini from the beginning was built to be a multimodal model. Um what is that what does that mean in practice? Why was it like why was that the case actually going back to Gemini 1.0 when we sort of planted the flag that we were going to build a language model that was multimodal? Can you sort of share that context? Yeah, totally. So yeah like deep mind has been um yeah like working on um multimmodal capabilities for a long time and the reason for this is like if we want to build AGI and like powerful AI systems that can perform these general human tasks vision is like a core component of the human experience. So tasks in various domains like medicine, finance and so on um have like a strong visual component. So like the vision with Gemini from the um Gemini 1.0 no days was to have a model that can see and perceive the world like people can. Yeah. Which enables these models to um yeah like perform these tasks in that manner. I I think about this all the time because it feels like um if you look at AI products in a lot of ways they're like they're uh sort of screaming to be multimodal and it's like so many of these like weird product experiences you have to build if you're like just building for this tax world. And like in a lot of ways like the solution is the um is it making it multimodal and like showing the model something visually and we you know we've pushed on the multimodal live uh live API thread and like that's a great example of this playing out where like people can actually go and build that stuff. Um what is it like back to this thread of Gemini being multimodal? What does it actually mean from a model perspective for it to be like a quote unquote natively multimodal model? What does that mean actually? Yeah. So we have a single model that's trained to be multimodal from the ground up. Um at like a high level what this means is like text, images, video, audio, like all these modalities are turned into like a token representation and the model is trained on all this information together. What this results in is a model that can like understand not just text but text with images and audio and video and so on. So the abstraction that I like to think of these models at is they should be able to see and perceive the world like um like we do and that's the goal behind like training these models to be yeah natively multimodal and like trained this way from from ground up. Yeah. Well is there some uh like do do you get like a compression loss effect when you do that? like I I think about like how I think of tokens um as just like numbers under the hood um and then I think of like an image and it's like you know the picture is worth a thousand words like how much do you but it also feels at the same time that the models are really good at multimodal so like what is there is there some magic happening there that like makes it so that you don't lose all of the like nuance of of things. Yeah totally. So a couple of things. The first is yeah I mean like losing information from images is like a big research problem. Um when we turn images into token representations we like inherently lose some information from the image. This a constant research question of like how do we make our like image representations less lossy. Um the second is when we um like extrapolate to video we sample videos at one frame per second. Um, and like during training there are like other tricks we can use, but we like lose information because the model's not seeing uh the entire video stream. So there is some information loss. I think the thing that's really surprising is these models generalize pretty well. So once it sees enough images, it like sees videos even if they're sampled at one frame per second, these capabilities generalize pretty well and it's kind of mind-blowing what these models are then able to do. Um, so yeah, these are like constant research questions that that we're working on. Yeah, I think if um for folks listening to this, if they haven't seen, we put out a blog post about uh Gemini 2.5 Pro having state-of-the-art model performance on video understanding and like it's a very visual use case and I think it's like go read the blog post because there's, you know, a ton of different um a ton of like really really great visual use cases and outlets that we built in AI Studio and others to like showcase this capability. But how much is that like the the video to like image related? Is it like there's a lot more complexity that happens from a model perspective on making videos work well or is it like actually just like pass a bunch of images behind the scene? Yeah, so video with the 2.5 models is like pretty mind-blowing. Um, and I'd say there are like a few things. The first is uh previous Gemini models, they were pretty good at video, but like robustness was a bit of an issue. So like one of the issues that we had for example was like if you fed in like an hourong video to a model the model would focus in on the first five and 10 minutes and then trail off for the rest of the video. So there are like some of these quality aspects that the team has worked a ton on and these are like very video specific um especially like long context video which is like really like the upshot of these capabilities. Um the second is just like core vision improvements and that like generalizes to video as well. Uh, so like one really cool example that we highlighted in the blog post is the ability to turn videos into code and this enables a ton of cool things. You can turn videos into animations. Um, you can turn videos into like websites. So I fed in like a YouTube video of like a recipe and turn that into like a step-by-step recipe. A use case that we see people using a lot is um videos of like lectures and like turning that into like lecture web pages and lecture notes and stuff. Makes college sound fun. Yeah. just take a bunch of boring lectures and feed them to AI and it like makes it all interactive and like custom learning experience. Exactly. Yeah. And like these things turn into like interactive apps that you can like learn with. Um so I think the the really cool thing about Gemini 2.5 is that it um unlocks video as a medium of information to do really useful things. Um so yeah, like I'd say it's a bit of both. Anie, you and I talk a bunch about how, you know, there's actually so many different vision use cases like bundled under this multimodal umbrella and there's like all the OCR stuff, video understanding, probably 50 other things that I don't even know about or think about that often. Um, how do you think about this from a multimodal product Gemini side and like what's the relationship and the interplay between all these capabilities? Is it like are they independent of each other? Do you see gain across all of them when one gets better? What's the what's the relationship? Yeah, two things. Um the first is I think we see like the uh pro of having like a single uh single multimodal model is that we see a ton of positive capability transfers. One of the cool things about this 2.5 launch is things like video to code work really well because the 2.5 models are just like a lot stronger at code. The second is like even within vision we see a ton of like capability uh transfer like in the past a ton of these like you would have had separate models for vision capabilities like a separate OCR model a separate uh detection segmentation model and so on. The cool thing now is like all of this is like bundled into Gemini and that and like that results in like a ton of cool use cases. Um so for example like say I'm like transcribing a video um that requires strong OCR but that also requires strong temporal understanding for the model to be able to understand what happens in the video and then transcribe that. One use case that like we're um yeah like super excited about in Gemini is using Gemini as a pair programmer. So we stream in like a video of your IDE to Gemini like ask it questions about your codebase get answers and so on. And this is like a use case that requires strong coding capabilities, strong just like core vision um which is like spatial understanding OCR um but then also the ability to understand a video and information in a video across time horizons um which is that like temporal uh yeah reasoning piece. I I love this use case. I think I would not be my bet is you know if we look a year from now like every developer product has something and actually like maybe even more generic than that like the OS is going to have something like that like different products are going to have like custom versions of this cuz it's it's so powerful. We've seen this already from like uh a customer traction piece of people building with the live API. there's just like so many cool like not run-of-the-mill AI applications that are being built, which is exciting to see because it feels like there's, you know, lots of people doing the same stuff and it's it's it's awesome when people go and build new things. Um, I'm I'm also super curious about just like what like how you think about what to focus on from a model capability standpoint versus where to your point about just like as the base model gets better, it sort of raises the, you know, the tide lifts all the ships or whatever the expression is. Um, are there certain areas where you're like, you know, ah, we don't really need to make that bet from a multimodal perspective because like it's just going to happen naturally or like are you having to like explicitly track that to like see, oh no, here we need to focus on this because you know it's not getting better as the base model goes up. Yeah, this is a great question. I'd like to split it out into um like three portions. Yeah. The first are use cases that we see are like critical today for users and customers. So folks using the APIs like developers um like Google products that that use Gemini for um yeah like multimodal vision use cases. So these are things that feel like short-term capabilities that we like need to make Gemini super strong at. Um the second piece which I think is like very critical are some of these like long-term aspirational capabilities. So these are things that people aren't asking us for Gemini to be able to do today but we think are like very critical for building like powerful AI systems and so on. Yeah. So I mean like one of the cool examples is um like visual reasoning like this is like and we see early signs of this with the Gemini 2.5 models is this ability to reason over pixels. Then we have like um a bunch of toy examples like I don't know you have you you have like a pinball with a bunch of surfaces and you ask Gemini like um about the the path that the ball would take and like which bucket it would fall into. That's like a capability that's super interesting because the model isn't reasoning over like text form, but it actually needs to reason over the image and like understand um like what the trajectory of the ball would be in an image. That's like a that's a very simple toy example, but you can extrapolate a future where this becomes critical for things like robotics. Like if um robots and self-driving cars have AI systems like Gemini powering like embodied reasoning, that unlocks a ton of use cases. Um so like these are things that like customers aren't asking us for today, but like the team is super excited by and we think are like very important for building AGI. The third, like you say, are things that we get surprised by, like things we kind of plan for. And this happens just from like scaling our models up. So 2.5 was a great example of this. Like we didn't plan particularly for these models to be like this amazing at like image to code and video to code, but this turned out to be like a super strong capability with 2.5. And I think the key is like when we see early signs of this happening, figuring out what the use cases are and like where these things can be really powerful. So an example of this is like UX to code. Like I think the workflows of like designers and product managers change completely with these capabilities. Like now I can sketch a UX, feed that into Gemini and it generates like a pretty good prototype using yeah like HTML or like JavaScript React for for that for that UX interface. And I think these types of things are super cool and are like capabilities that we get surprised by. Th this makes being a PM way more fun. Um through this lens of stuff that you and I spend a bunch of time talking about uh one of the bits is just like what are people actually building with multimodal and like how do we you know help the people who are building interesting stuff now but also like help sort of try to convince builders and startup founders like here's here's all the next things you could be building. Here's what are the good ideas. Here are the capabilities. um through that lens, like what is some of the stuff that you're you're most excited about from a vision standpoint and like the product experience people could be building with this stuff versus what I think they're building now, which is not not not that much. Some interesting stuff, but there could be there's lots more capability to pull out of the model still, I think. Yeah. So, I think we're starting to see people do some like really cool things with vision. And some of this just happens as the models get strong enough to um yeah, like make these make these things work. Um yeah, I like to think of this in like three three buckets. Like the first are use cases that like um existing models or systems were like able to do. So these are things like traditional OCR translate um image retrieval. So like Google Lens does this really well for things like shopping like find a similar sweater and things like classification like um help me identify this plant or this animal and so on. So I think we're seeing a lot of usage in this sphere of things because people are used to using existing vision uh existing vision modules for these things. Um uh and like Gemini like is a single model that's like able to do all these things. I think it starts to get more interesting when we look at the second and third. The second are like the set of use cases that I like to think of these of like Gemini being able to do are um tasks that like a human could do or say you had like an expert in a given domain with you tasks that they could do. So these are things like we like travel to London a bunch for for work and like something I've really enjoyed doing is like taking taking Gemini out and like walking around the city and like asking questions about things around me. Previously, I would have had to like figure out a question in text to ask Google, get a like get a response. But now I have like a completely lossless way of asking these questions using vision. Another cool use case that I was trying the other day was um I had a Google doc with a bunch of comments and I took a screenshot of the doc with the comments, fed that into Gemini and I was like, "Hey, like help me rewrite this doc while like answering these comments." Gemini did a pretty good job. I think like 50% of the comments were addressed perfectly, 30% were like pretty good. I needed to make minor tweaks, 20% I had to rewrite. But I think if we extrapolate this, we see like a world where um like everything is vision and these models can like see your screen, see the world like just just like we can, but they're also domain experts in like every field, which I think is like a future that I'm super excited by. The third set are use cases that I think of as like beyond human or like beyond tasks that humans could do in like a feasible amount of time. So these are things like being able to watch a six-hour long video and like find specific moments where things happen. So like I don't know you feed in like a very long sports game and you generate a highlights real like this takes a lot of time for a human to do or generate like uh generating fine grain segmentation masks on an image is like something that is like hard for humans to do. Like another example is like some of the video to code things where you like you have a video and you can turn that into like an interactive like learning application. These are things that would take people like a long time to do but you can just like zero shot Gemini with these things. And I think we're like very early in that world because it takes time for people to build intuition for what these models can do and also takes time for us to build the interfaces for people to do these things smoothly. But I'm really excited for like a world where we can like really tackle like the second and third piece. How different do you think those products tackling the second and third bit look from today's bar? Is it like you like you know imagine I have a AI chat app today? Yeah. And I want to sort of fully embrace this world of like you know everything is vision through that through that mantra which I love that we should make sure I'm going to I'm going to have shirts made that say everything is vision. Um what's the delta like? What do what do folks actually have to do if they want to sort of buy into that road? I do think like builders are like trying to find the edge in today's world and I think you and I again have had this conversation a bunch of time that like there's so many interesting edges for people building and vision right now just because there's not that many products in the space. Um yeah so I don't know do you have any advice of like how to do the exploration to find the experience? Yeah. So something that I really like doing is um anthropomorphizing these models as much as I can. M so thinking of these models as expert humans at a given task and treating the interface in a manner that like a human would perform a given task in or like do something in. I think as an industry we like defaulted to chat as an interface primarily because like humans are very used to using chat like we like message all the time we use search for retrieval and so on. Um, but I actually think some of these like human modes of communication and like interaction are actually far more natural. Um, part of this like models getting good enough to do these things and I think we're like getting there. Um, so I think the thing that I really think about is like can we make these models feel as natural as possible and like the world was like built for humans. So I think it makes sense to like build these machines and these systems in the same way. But yeah, there's like still some still some work to get there. Like a vision of the future that I'm really excited by is like today like most AI products are term based. So you like query the model or this or like even a system you get back an answer query the model again you get back an answer and you like repeat that process. um in this view of the world where like everything is vision products that like I'm super excited by are um like having a world where uh the interface of interacting with AI systems is like a birectional like audio video interface right so this results in some cool things like your model can understand audio and video um just like a human would um it can be proactive so based on visual cues it can like suggest just for like this is what I have to do. This productivity bit is what I'm most excited about because I think there's so many use cases where I'm like yeah I could ask the model like if I'm showing if the model can see my screen on my computer for something I could ask the model to do it. I'm like I don't really want to like I'd like to just like write like here are the things I you know you could do for me like take action on these whenever this thing happens on the screen. Like that would be like I get an error in my terminal. like I kind of want the model to just like go and you know find a bunch of stuff and give me suggestions to fix without me having to like actually talk to the model. So it is this like interesting um yeah there's like so many interesting product directions and it actually feels like there's not it's like actually not that complicated to build that which is also really interesting. Like it's just the live API with a bunch of the stuff that we just shipped in it and like out of the box it can do a lot. Exactly. So the crazy thing is these models are like pretty good at these things today. I think one way that I like to think of what new products could look like is like imagine you had like an expert human looking over your shoulder and seeing what you can see and like helping you with things. I think we like the form factor where this works today is like screen share because you can like feed your screen into a model and like you're looking at your screen to perform tasks. Um, one example that I've used Gemini Live for is like I was cooking and previously I would have had to like follow a step-by-step recipe and try and pattern match what I'm doing to the recipe and like more often than not like it doesn't turn out exactly like what's in the recipe. Something cool that Gemini can do is it like looks at what you're doing um as you're doing it and then proactively based on visual cues in the video suggest for things to do, right? So, I don't know, I was like boiling pasta. I was like, "Hey, add the pasta now." And like things like that. Were you just like holding up your phone to do this? This is why we need glasses. Exactly. Exactly. So some other mechanism to like of you know Yeah. Yeah. So I think the core problem is developing the interfaces and I think we've like moved towards this world of glasses and like we're working on these things at Google as well. Um so like that could be one way that we do this. But I think there might be others too like the the phone isn't great because you like lose some amount of mobility so it doesn't feel as natural. But to me like not a necklace. Exactly. Yeah. Some people have tried necklaces as well. So I think like thinking of these models as being able to or these systems as being able to like look over your shoulder and see what you see and help you with things in the real world. I think it's like very powerful. The question is how do we build interfaces to actually enable this in practice. Another piece like uh related to this discussion that I think is very exciting is like um along with proactivity these models being able to like at a high level multitask right so we have thinking models now imagine if I could like talk to the model it sees what it like has has vision so it can either see me or like sees what I see and I can think at the same time while I'm talking to the model so it's able to like take in audio and video but then also like think at the same time in like some form or um we have things like project mariner where uh Gemini can like actuate on a screen and perform actions. I think a cool world is where like while I'm talking to Gemini, it's like doing things on my screen um providing me with feedback and things like that. I mean, this is like the thing that I'm really excited about and like what I think a lot about is like how can we make these models feel as like human or even beyond that like superhuman as possible. Yeah. And thinking of the interfaces as being as close to that um as we can make it. I love that. Anie, we talked before about how um the model actually understands on the back end like what an image looks like from a token perspective. How does that happen on the video side? like what's the delta between a video understanding use case and an image understanding use case behind the scenes. Yeah. So like firstly Gemini is like uh one of the only foundation models that can take in video and like state-of-the-art at like video understanding and like reasoning over videos. For Gemini to be able to understand video needs to be able to understand both the audio component and the visual component. And this is like a pretty tricky problem to solve because you need these things to line up and so on. But like the way this happens today is we like interle audio and frames that correspond to that audio at each given time chunk. And what's really remarkable is this generalizes pretty well. So the model is able to understand videos pretty well using this approach. Yeah. And yeah, like feels feels pretty natural. I part of this FPS conversation and we've been kicking around a bunch of threads on this for a while. I don't have a good intuition as to like why at the model level to have multiple like different FPS's we have to like do something like why can we not just like take just like grab more images and does it just like make the audio bit like kind of like garbled or like less there's just like less context attached to each image so you just like lose reasoning capability as it as it processes or like why is it hard to add an FPS yeah functionality. Yeah, like a couple things. So the first is um like part of this is just like a function of design. So something that like seemed to work pretty well and like one FPS did a pretty good job. That being said, there are a bunch of use cases that having higher frame sampling helps a ton for. So I we've seen people come to Gemini to do things like um feed in your golf swing and have Gemini rate your golf swing or like critique your dance moves. Yeah. So for these types of things, having like higher FPS is is uh yeah, like super powerful and this something that we're working on. Um we actually saw that like this was a real need when we saw customers start to slow down videos. So folks would want like let's say 5 fps. So they'd slow down their videos by like 5x to be able to support this. Um, part of the reason for 1 FPS is just like the way we designed Gemini and like our tokenization sampling at 1 FPS supported around an hour of video. So, it was like a pretty clean video length to support with these models. Yeah. That being said, like we've now come up we've now uh released more efficient tokenization. So, these models can do up to six hours of video with uh two million context. This is with lower detail too. This is Yeah, exactly. Yeah. Yeah. lower detail but performance is like surprisingly very high. So we just like um we like represent each frame with uh 64 tokens instead of 256 previously. And what does that mean? Is it just like there's like a less verbose description of what's happening when you use so like in a you know we're sitting in a library if you take a picture of the behind us and you did the 6v4 token representation you would just like see less titles of the book or like what what actually and maybe this is like a noisy example so it's harder to make. Yeah. So this is like a very abstract idea like what we've actually seen is so like as our like tokenization methods have become uh stronger like we need fewer tokens to like represent frames in a video. So say with Gemini 1.0 like back then representing an image with 64 tokens was just like a very lossy representation. What we see today is like actually 64 tokens performs like remarkably well um almost actually to the same quality level as 256 and like um back to your question earlier of like like why can't we just sample at like higher frames per second um part of the reason is like the models were trained um at the 1 FPS got it um like sampling rate and like what this results in is the model learning to line up audio and video um at this like time frame or like sampling uh sampling method. Um but yeah, this something that we're working on and like we have a bunch of cool things to share coming soon. Yeah, I'm I'm excited for the higher FPS bit to land so that Jim and I could tell me how horrible my golf swing is. Yeah, because I don't I don't golf enough. Um you and I have talked a bunch about just like and and as also as you sort of transition to doing model personality stuff like what is the future for Gemini multimodal stuff start to look like obviously we've we've launched a bunch of the native output modality capabilities now with audio with image uh you know some future version of the world we'll hopefully have video as well which would be awesome in in a single model um but on the other like nonoutput mode which I it feels the maybe the frontier is on output modalities now and less on input modalities, but do you see there's like still a bunch of places to to hill climb from a quality or capability standpoint from a multimodal input perspective? Yeah, there's a ton. So, um I think the first thing is like we want to get to a world where these models are amazing at uh multimodal in, multimodal out. So, it can take in any modality, generate any modality. Um and some of the generation stuff is super exciting. On the vision side, there's still a ton to do. So, um, one of the things that I'm really excited about is, um, bringing some of these capabilities together to form a more cohesive system. So, for example, Gemini is amazing at spatial understanding. So, it can generate 2D bounding boxes, 3D bounding boxes, point coordinates, segmentation masks, and so on. This is really cool. Just for folks who haven't tried this before, if you were to like screenshot, if you're watching this video recording, if you were to screenshot and look at behind us, you can be like go to AI Studio right now, which also has the native image generation and editing. And I think it benefits from the spatial understanding. You could say like move the couch and Anie against the wall and it'll like it'll understand what that actually which is just have to make the plug for how cool it is to like actually you say spatial understanding and people are like h that's meany thing but it's actually like so cool and Yeah. Yeah. I mean it's it's it's like super amazing. Like the cool thing about spatial understanding is models in the past were able to do detection. Um the cool thing about Gemini being able to do detection is you have this reasoning backbone and world knowledge as well. So some of the cool things Gemini can do. This is like a very simple example but you ask Gemini to detect the person that's like the furthest to the left in this in this image. And Gemini is able to do that because it's able to reason over the image, understand like that an object is uh or like the relative positioning of an object and then generate the bounding box. Something that I tried before was I uh took an image of like our uh like the fridge in our micro kitchen and I was like which drink has the fewest calories and a generated bounding box around the bottle of water. So these things are like super cool. Um we're still in like very early stages of these capabilities. Um so I think there's like a lot more that we can do there. they still kind of feel like um they like still feel like a toy. Uh so yeah, there's like a ton we can do there. That being said, like a lot of these niche capabilities have um very specific groups of power users. So on the spatial stuff, we have folks building models for robotics that are like using these models a ton. Yeah. Because spatial understanding is like a core building block for um like embodied reasoning and perception for um robots. So I like what I'm excited about is like seeing some of these capabilities come together um and like both on the understanding side but also generation. I mean like one of the cool things that you get with spatial as well is like some notion of thinking right if a model can like point to objects in an image generate bounding boxes for things in an image that improves the model to be able to reason and think over like visual um like data formats and that's something that I also think is like super cool and yeah like we have like tons of folks working on yeah I think vision still feels like to opine on a point that you've already made like it feels like one of the areas where there's the biggest gap between the model capability and the products that people are building. Like it it just feels like there's so many interesting things to be built and like there's just not that much stuff being built. Uh which gets me kind of excited for again for people who are in the position of like going in building companies around this stuff. One of the modalities that we talk a lot about is this like document understanding maybe not modality but use case. Um, and we've seen a ton like there's ton of positive traction around Gemini and document understanding and OCR. You want to talk about that and like what it takes for the models to be good at that like what that use case looks like, why people are so excited about it. Yeah. So, um, like a ton of information is stored in documents. Uh so it's like very clear that like documents is like a powerful like medium of information that like Gemini should be really good at analyzing and reasoning over. I think the reason why we see a ton of demand for documents as a vision use case is there were existing like vision models that were able to do OCR and translate and things like that. The cool thing that you get with Gemini though is you get these capabilities, but you get them with like the reasoning backbone that like Gemini offers. And some of the things I'm really excited about with documents is being able to feed in a ton of documents as as context to the model to perform like a fairly complex like multi-step task. That's something that like existing models weren't able to do pre pre uh Gemini. And I think given that so much information at like like so much personal information at companies and things like that are like stored in documents that's like a very powerful visual use case. The other reason why using Gemini for like documents is interesting is because in the past with documents, the way these like workflows worked was users would like OCR a document and then feed that as text um into like AI model like uh AI systems or like OCR these things and then store information in that medium, right? So like search over that information for retrieval and things like that. I think the cool thing about using vision to understand documents is uh like now you have like a really powerful system that can see a document just like a human can. Um and some of the cool things here is like documents are are often not just plain text like they have interesting formats. They contain charts, images, diagrams, things like that. In the past, these were very hard to like transcribe and use for like even use cases like search, but but also some of these more complex tasks. Um, the cool thing about Gemini is you can just feed all this in. Gemini like reads all these documents like a human does and then is able to like do a ton of cool things. I mean like something that I was trying the other day was I fed in earnings reports from companies over the last like 10 quarters with like a million token contacts uh which is like tens of thousands of pages with uh yeah like 2 million tokens and like got it to like do a bunch of analysis on these companies for me. The cool thing here is like gem like to to be able to do this effectively, you need to be able to like read very long intricate tables in these documents which um like previous uh like OCR modules like weren't weren't as great at. So I think given that like documents are like such a massive like store of information, it's uh uh it like yeah makes a ton of sense that like people are using Gemini for this and it's something that like we care a lot about as well. It it also feels uh like very unique uniquely Google in the sense that like you know through the I think the official Google mission is organize the world's information and make it universally accessible. I think there's like so much of like data that even I was just thinking about myself. I'm like I have a drawer somewhere in my in my apartment of a bunch of you know hard copies of paper that I'm never going to look at again. And like it's not universally accessible in that medium even for the data that I have today. So um yes. Yeah. Yeah, I mean like that's an amazing point. I think two things there. Um, this is something that I'm super excited about on vision is I think it like unlocks vision as like a as like a store of information and makes like visual information so much more accessible and useful to Google's mission. Um, so with like documents, we see this like Gemini is amazing at like what we call layout preserving transcription. So it can transcribe a document um and preserve layout, style, structure. The other really cool way that like vision makes information more accessible is video actually. So something that like we see a ton of people do is like take videos of things around them, feed these videos into Gemini and then use that to catalog information. Um so yeah, like a ton of people take videos of their bookshelf and like libraries and then like catalog that information. Yeah, I do this in AI Studio all the time where I'll take like long videos of like whatever topic someone talked about in a in a podcast that I was a part of or I listened to or whatever it was and like have them pull out, you know, interesting clips and stuff like that. It's it's super super powerful to not have to um especially with content where I don't want to hear myself have to talk. You let the let the model take care of all the through the work. Exactly. It also makes like tasks far more efficient. is something that I did um when I was uh yeah like at our um uh office in London was like we have we have like a very nice library there and I took a video of like all the books in this library and asked Gemini to catalog these books by genre by author and because Gemini has this like world knowledge reasoning backbone but then also these visual capabilities and like a single model I was able to do this super well I did the same thing like at our micro kitchens like let's try this here after this yeah it was One thing that I've been trying since the Gemini 1.0 days is like cataloging snacks in our like MKS. Yeah. And um yeah, these models are now like pretty much perfect at these use cases. And I think this like goes to some of the questions around user interfaces as well. Like this is something that doesn't feel like a natural use case because it hasn't been possible before. Like people are very used to using vision for like traditional OCR, translate, um classification, these types of things. Um, but what like Gemini's multimodal capabilities gives us is the ability to like do so much more that we previously wouldn't have thought was possible. Um, and I think like that'll take some time to to play out as well. Anie, one of the sort of consistent threads of and I don't even maybe this is intentional, maybe it's not. I'd have to reflect on this of a lot of the conversations we have with people through this podcast is just about like how much the like team you know sort of team Gemini like everyone is there's like deep collaboration there's all the different modalities getting better makes other people's modalities better like I'm curious like on the multimodal side like as someone who has a little bit less context of like what the team looks like and what the structure is like how um yeah how do you think about that like what's happening who who who are the research people all that stuff and I know there's a there's a ton of people but yeah yeah so there are like a ton of people firstly like I'm just a spokesperson for like a massive research team and yeah like the Gemini multimodal team is like the most amazing team um and we've like grown a ton since the Gemini 1.0 nowadays which which also shows like how strong these capabilities are getting. Um and yeah, I think the thing that's like really awesome is multimodal has like so many of these capabilities and like a ton of things that that uh uh that we've spoken about and like what you need to make this happen like if like what's a very hard problem is you need to like bring these capabilities together into a single model and make sure that like each capability performs super well and we have JB who like leads our multimodal team. He's a rock star. He's he's he's been working on vision from before Gemini like back in the flamingo days. And we have workstream leads image video and like all these things spatial. And I think what's been like really remarkable is like how like all of this has come together into a single model um that is like super strong at these uh yeah like at these multimodal capabilities. Yeah, it's very interesting to reflect on like I don't think that's like the default outcome. I think this is like I don't maybe it's maybe it's just we have great people who work really well together but like I think this it's it's hard to make that collaboration happen and it's so cool to see like time and time again like it actually plays out and it works and like we get the great results from a model perspective you know from everyone coming together. So it's it's awesome to see. Yeah. And I I I think the other thing that's like really awesome is um I think the team really thinks about thinks really deeply about a like how developers and consumers will use these vision capabilities and I think we like really try and build strong intuition for this and bring that into our model. So we have this like very close product model feedback loop. And the second is we spend a lot of time thinking about and like chatting with with each other about how people will use these capabilities in the future. So if we extrapolate to like these capabilities becoming much stronger and like coming together in like a cohesive way like what are the ways that people are going to interact with these models a year from now, two years from now, 5 years from now. Um, and a lot of the capabilities that go in today are like building blocks towards this vision that our team has, which I also think is like super powerful. Yeah, it's been awesome to see all the progress on multimodal for the last year. It's been awesome to collaborate with you and JB and the multimodal team. Um, so I'm super appreciative of all the hard work that y'all have done. Even outside the model stuff, like you know, if you if folks have complaints about the multimodal API docs, go go to Anie and he'll he'll help make them better. Um, but you're you're transitioning now to go start working on model behavior stuff. So, we won't do a deep dive on model behavior, but just to sort of plant the seed, what is um I I think this is definitely somewhat emergent right now. So, like what what will you be thinking about next? Yeah. So, I think related to some of the things we've spoken about, something that um I think is like a very important problem is having these models feel like they're natural to interact with. Um, I go back to this like the like world today where we have these like very turn-based systems. It feels kind of unnatural. It feels a bit dated. Um, and something that I'm passionate about is like building AI systems that feel likable that you can like interact naturally with. So I mean like going into more detail on the model behavior stuff, how this translate is uh translates is like giving the model skills like empathy and being able to understand the user, understand implied intent, giving the model like a personality while striking the balance of like um like all of these like uh yeah like amazing raw capabilities that that that Gemini has. I think the other piece to some of this is um a lot of the AI use cases today like these models just give you a ton of text. Something that I've been thinking a ton about is like are there interesting visual formats that we could use to be able to like communicate information in like a more information dense or like high calorie manner is uh like is uh yeah like the way that we like to think about these things and I think it's like a very critical problem for uh yeah like making making Gemini uh like a nice model to to talk to and interact with. I'm excited for this. I think my my seed to plant with you is that um people really like the way that the sort of notebook LM audio overview personality is and like the way that it sort of engages from a conversational standway is like super super relatable and folks really like it. So I think there's some interesting thread to pull on on that which I'm yeah we'll I'll we'll have to catch up more about this uh sometime in the future and see and see if there's any interesting outcomes. to give you credit for folks who have been watching. You and the multimodal team have been I think like one of on the AI studio Gemini API side some of our strongest collaborators and I I think it's been a ton of fun to work with you and the team to like do that like it feels like this very unique like research to product acceleration story which is like you know you all care about what the API looks like and what the capabilities are and all this stuff. So I'm appreciative of you and you and JB and everyone else for for pushing so hard to make all that happen. Likewise. Yeah. Yeah. Uh, yeah. Thanks so much for bringing these capabilities to life through Studio. Thanks for thanks for taking the time to sit down um and chat everything multimodal. And thanks everyone for listening and we'll we'll see you in the in the next episode. [Music]
Original Description
Ani Baddepudi, Gemini Model Behavior Product Lead, joins host Logan Kilpatrick for a deep dive into Gemini's multimodal capabilities. Their conversation explores why Gemini was built as a natively multimodal model from day one, the future of proactive AI assistants, and how we are moving towards a world where "everything is vision." Learn about the differences between video and image understanding and token representations, higher FPS video sampling, and more.
Chapters:
0:00 - Intro
1:12 - Why Gemini is natively multimodal
2:23 - The technology behind multimodal models
5:15 - Video understanding with Gemini 2.5
9:25 - Deciding what to build next
13:23 - Building new product experiences with multimodal AI
17:15 - The vision for proactive assistants
24:13 - Improving video usability with variable FPS and frame tokenization
27:35 - What’s next for Gemini’s multimodal development
31:47 - Deep dive on Gemini’s document understanding capabilities
37:56 - The teamwork and collaboration behind Gemini
40:56 - What’s next with model behavior
Watch more Release Notes → https://goo.gle/4njokfg
Subscribe to Google for Developers → https://goo.gle/developers
Speaker: Logan Kilpatrick, Anirudh Baddepudi
Products Mentioned: Google AI, Gemini
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Google for Developers · Google for Developers · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Developer Journey - Sunnyvale DSC Summit ‘19
Google for Developers
How Google is working with students - Sunnyvale DSC Summit ‘19
Google for Developers
Starting your career in the Cloud - Sunnyvale DSC Summit ‘19
Google for Developers
The Solution Challenge - Sunnyvale DSC Summit ‘19
Google for Developers
Firebase - Sunnyvale DSC Summit ‘19
Google for Developers
Cloud Hero - Sunnyvale DSC Summit ‘19
Google for Developers
Panel discussion - Sunnyvale DSC Summit ‘19
Google for Developers
The art of negotiation - Sunnyvale DSC Summit ‘19
Google for Developers
Courage to care, solve and share - Sunnyvale DSC Summit ‘19
Google for Developers
Version 9 of Angular, Glass Enterprise Edition 2, path to DX deprecation, & more!
Google for Developers
[DEPRECATING] Introducing a new series (Assistant for Developers Pro Tips)
Google for Developers
Detecting memory bugs with HWASan, Bazel 2.1, Next ‘20 session guide, & more!
Google for Developers
Why Podcast.app chose a .app domain name
Google for Developers
Machine Learning Bootcamp Jakarta 2019
Google for Developers
Android Studio 3.6, Android 11 Developer Preview, Kubeflow 1.0, & more!
Google for Developers
[DEPRECATING] Importance of community (Assistant on Air)
Google for Developers
Why the Flutter team switched from .io to a .dev domain name
Google for Developers
3 website-building tips from .dev creators
Google for Developers
Why NimbleDroid chose a .app domain name
Google for Developers
Android Platform Codelab, Bazel 2.2, Maps Android Utility Library v1.0, & more!
Google for Developers
Google for Games Developer Summit: A free, digital experience for game developers
Google for Developers
Inspecting Home Graph (Assistant for Developers Pro Tips)
Google for Developers
Google for Games Developer Summit Keynote
Google for Developers
Stadia Games & Entertainment presents: Keys to a great game pitch (Google Games Dev Summit)
Google for Developers
Empowering game developers with Stadia R&D (Google Games Dev Summit)
Google for Developers
Supercharging discoverability with Stadia (Google Games Dev Summit)
Google for Developers
Stadia Games & Entertainment presents: Creating for content creators (Google Games Dev Summit)
Google for Developers
Bringing Destiny to Stadia: A postmortem (Google Games Dev Summit)
Google for Developers
Live Captioning in Google Slides
Google for Developers
[DEPRECATING] User engagement for the Google Assistant
Google for Developers
TensorFlow Dev Summit ‘20, Google for Games Dev Summit, Cloud AI Platform Pipelines, & much more!
Google for Developers
Top 5 from the TensorFlow Dev Summit 2020
Google for Developers
Developer Student Clubs 2019 Turkey Leads Summit
Google for Developers
Building simpler payment experiences | Google Pay Plugin for Magento 2
Google for Developers
Become A Developer Student Club Lead
Google for Developers
Firebase Kotlin Extensions, ARM apps on the Android Emulator, Angular v9.1, & more!
Google for Developers
Test suite for Smart Home (Assistant for Developers Pro Tips)
Google for Developers
Google Play updates, Bazel 3.0, Business Console for Google Pay, & more!
Google for Developers
How to use error logs (Assistant for Developers Pro Tips)
Google for Developers
Contact Center AI, Android Studio 4.1 Canary 5, TensorFlow QAT API, & more!
Google for Developers
WebView DevTools, Kotlin meets gRPC, Flutter CodePen support, & more! (Episode 200)
Google for Developers
Offline handling for Smart Home (Assistant for Developers Pro Tips)
Google for Developers
Android 11 Dev Preview 3, Google Fonts for Flutter, Shielded VM, & more!
Google for Developers
Machine Learning Foundations: Ep #1 - What is ML?
Google for Developers
Flutter web support updates, BigQuery materialized views, Cloud Spanner emulator, & more!
Google for Developers
Computer vision by building a neural network with TensorFlow | Machine Learning Foundations
Google for Developers
Machine Learning Foundations: Ep #3 - Convolutions and pooling
Google for Developers
Android 11 Beta plans, Flutter 1.17, Dart 2.8, & much more!
Google for Developers
Machine Learning Foundations: Ep #4 - Coding with Convolutional Neural Networks
Google for Developers
Google Developers ML Summit
Google for Developers
Real-world image classification using convolutional neural networks | Machine Learning Foundations
Google for Developers
Adobe XD support for Flutter, Architecture Framework, temporary closures with Places API, & more!
Google for Developers
Machine Learning Foundations: Ep #6 - Convolutional cats and dogs
Google for Developers
Machine Learning Foundations: Ep #7 - Image augmentation and overfitting
Google for Developers
Announcing Firebase Live, Flutter Day, Java 11 on Google Cloud Functions, & more!
Google for Developers
Machine Learning Foundations: Ep #8 - Tokenization for Natural Language Processing
Google for Developers
Android 11 Beta, Google Play Asset Delivery, Firebase Crashlytics SDK, & much more!
Google for Developers
Natural Language Processing: Using sequencing APIs in TensorFlow | Machine Learning Foundations
Google for Developers
Build a sarcasm classifier using NLP and TensorFlow | Machine Learning Foundations
Google for Developers
AR Realism with the ARCore Depth API
Google for Developers
More on: Multimodal LLMs
View skill →Related Reads
📰
📰
📰
📰
Claude Sonnet 5 Just Launched. Is It Actually Better Or Just Newer?
Medium · AI
Claude Sonnet 5 Just Launched. Is It Actually Better Or Just Newer?
Medium · Machine Learning
Claude Sonnet 5 Just Launched. Is It Actually Better Or Just Newer?
Medium · LLM
Claude Sonnet 5 Didn’t Just Get Smarter. It Changed the Economics of AI.
Medium · LLM
Chapters (12)
Intro
1:12
Why Gemini is natively multimodal
2:23
The technology behind multimodal models
5:15
Video understanding with Gemini 2.5
9:25
Deciding what to build next
13:23
Building new product experiences with multimodal AI
17:15
The vision for proactive assistants
24:13
Improving video usability with variable FPS and frame tokenization
27:35
What’s next for Gemini’s multimodal development
31:47
Deep dive on Gemini’s document understanding capabilities
37:56
The teamwork and collaboration behind Gemini
40:56
What’s next with model behavior
🎓
Tutor Explanation
DeepCamp AI