What is Google's Gemini 1.5 Pro | 10 Million Token Window

Alejandro AO · Beginner ·🧠 Large Language Models ·2y ago

Key Takeaways

Google's Gemini 1.5 Pro is a groundbreaking language model that delivers dramatically enhanced performance using a multimodal model and mixture of experts architecture, with a large context window of up to 10 million tokens.

Full Transcript

good morning everyone how is it going today welcome back to this video and welcome back to the channel where we talk about everything about software engineering and Ai and how to implement the new AI Technologies and libraries and models into your super cool applications in the today's video we're going to be covering a pretty groundbreaking news which is the release of Gemini 1.5 by Google a model that is pushing the boundaries of what we thought possible in Ai and we're going to be going over the official release post and summarizing it and explaining to you what actually each part of it means and we're also going to be delving a little bit into the more technical report and explaining how this new model compares to other models in the market so without any further Ado let's get right into [Music] it [Music] so according to Google this new language model is really a groundbreaking leap into the future of multimodel machine learning okay so it delivers dramatically enhanced performance we're going to see how that works it does that with a multimodel model I'm going to tell you a little bit more about what that means and it does that using the mixture of experts architecture I'm going to tell you a little bit about that as well the thing that is the most impressive probably and the thing that is making the most fuzz in the internet is the their ridiculously large context window in the standard version version they have 128,000 tokens however for production they can go up to even 1 million tokens and in the official report they say that they have even tested up to 10 10 million tokens which is ridiculous I'm going to tell you a little bit more about that and show you how impressive that is but let me just explain to you first what multimodel means so what a multimodel model means is that it can take several kinds of formats as input not only text okay so to show this let me let's imagine that you have a model like gp4 or Gemini 1.0 or something like that and you want to feed it a video or an image or an audio file what you would have to do is first extract the text from that video or from that audio file and then feed the text to the language model what makes a multimodel language model different is that it is capable of taking as input natively the video or the audio files and tokenizing them and indexing them without having to rely on other services such as the whisper API or something like that to convert the audio into text so that is pretty impressive and it comes by default with Gemini 1.5 now some other interesting part about the model is their architecture Google says openly that the architecture that they are using is the mixture of experts architecture which is described in this paper right here now I'm not going to guide you through the entire technical paper right here but just so you know what a mixture of expert architecture means is the same architecture that mistal and probably gp4 are using which means that the whole language model is actually several language models working together under the hoot so what a mixture of experts design is is a machine learning strategy where multiple specialized models so-called experts are trained to perform very well in different very specific tasks okay so for example you will have one model that is very good at mathematical reasoning another model that is very good at varable communication and copywriting and narration another model that is very good at debugging code another model that is very good at software design for example things like that and what the mixture of expert architecture does is it takes all of these models that are very good in each in their specific task and put them under a single hoood and whenever the user sends a query the model chooses which expert to use for the specific task so if the user is talking about mathematics is going to use the expert in mathematical reasoning if the user sends a query about code and debugging it's going to use an expert in debugging so what you get in the end with this architecture is a system that is very good at handling very different tasks by averaging the strengths of different experts leading to very improved performance and flexibility of your model as well so that's the architecture that they are using now something else that we have to talk about is its ridiculously long context window okay as we mentioned before the context window can take up to 1 hour video 11 hours of audio or 700,000 words with the 1 million token uh version okay that is just ridiculous just to put it in perspective 700,000 words is pretty much all of the works of the har Potter series minus one book I think all of that sent at once in a single prompt to your language model just think about how revolutionary that is if it actually works as they say it does and that is only with the 1 million token window that is available in their API however according to their reports they have tested up to 10 million token windows in their research with pretty good results we're going to take a look at how that looks in a moment but yeah pretty impressive now let's take a look at the performance right here for a moment and apparently it is very very impressive so what we're going to have to see is their actual official report that he published alongside with the blog post and here we have some results right here the results show the plots for the test needle in the Hy stack okay I'm going to explain to you real quick what needling the Hast stack is with the text Hast stack so let's go to page nine first of all which is where they make the comparison with gp4 and here we have it so let's see how needle in the hstack works um needling the Hast is a technique is a test to evaluate a language model ability to find specific information which would be the needle in a very large context which would be the Hast stack in this case what they do is they take this this piece of information right here this short sentence which says the special Magic City number is and then the given a number okay and they hide this information somewhere in the prompt and then at the end they ask the question or they ask which is the magic number and they evaluate if the model returned that correctly or not so how this plot works here in the vertical axis you have the depth which means where in the context the hidden sentence is positioned 0% means at the beginning of the context 100% means at the end of your context and the vertical axis right here is basically just the context window okay so here you have 32,000 tokens up to 10 million tokens okay so let's just consider for example this point right here means that they sent a 128,000 prompt which is somewhere about uh 30,000 uh words s and they hit this piece of information special Magic City number is let's say seven at 43% of the prompt okay so somewhere along the middle and then they asked the question what is the final the magic number and then apparently it responded correctly because we have a green square okay so that's how this works and so you can see the comparison between Gemini 1.5 Pro and GPT 4 Turbo uh as you can see the Contex window for G 1.5 Pro is ridiculously large here you have up until 1 million and you can see that GPT 4 Turbo goes only up to 128,000 tokens so GPT 4 Turbo has a 100% recall and apparently Gemini 1.5 pro has a staggering 100% recall up to 530,000 tokens which is just ridiculous and then it and then they say that it has almost 100% up to 1 million tokens which means that in the test that they made the language model was capable of finding hidden information inside a huge context of 1 million tokens 99.7% of the time ridiculous the second performance test that I wanted to show you is the one that they do with video okay for this one they use gp4 Vision as a benchmark and as you can see well according to them they do ridiculously well all right um what I did here is I needle in the Hast stack test just as with the text however this is a little different because it is multimodel in the sense that the question that they ask in the end is in text of course and they are asking the model to retrieve information from a video okay so that that itself is already quite impressive and the thing here how they do it is that they overlay the test the secret word is whatever on a single randomly sampled video frame so for example they have a video and they put this text in the first frame or in the frame at the 12 20th second Etc and they ask in the end a question to the language model what is the secret word and the language model has to find whatever the secret word they overlaid in that specific frame okay just as with the text the xais is the context window in this case it is in minutes because we're we're measuring a video and the depth is the place in the video where the overlay was okay so we have Z from 0% which means at the beginning of the video to all the way to 100% which means by the end of the video and as you can see well gp4 Vision was pretty good 100% recall for all of their context window which goes all the way to apparently 5 minutes 4 minutes and Gemini 1.5 Pro goes to 100% as well but up to 3 hours well 3 hours is what they test in research and up to 1 hour which is what they have in production okay so that in itself is quite impressive last but not least we have the a Hast stack test which is pretty much the same thing as the one that we saw in video also we have in this case a multimodal test and what they do is they hide a very short clip of audio lasting a few word a few words where the speaker says the secret keyword is whatever within the audio uh signal and they hide that in a corpus of a a very big audio right so they use the Vox popular data set with multiple speakers to make it harder for the language model and then they experiment when they input audio ranges from 12 minutes to 22 hours or 2 million tokens okay they insert this needle which is this phrase right here in different positions across the signal and they test whether or not the language model is capable to find the secret word okay in this case apparently Gemini 1.5 Pro was pretty much perfect all the way to 22 hours that means that they gave it a 22-hour audio file with this needle secretly placed somewhere in the file and it was always able to find information directly pretty impressive and they test this against gp4 turbo plus whisper because of course this gp4 is not a multimodel model so what they do is they take this same data set they pass it through whisper which is the API by open AI that allows you to convert audio to text and then they pass whatever that had whatever the text they got from whisper to gp4 Turbo and they ask the final question for the secret keyword and as you can see gp4 turbo had a much less accuracy and recall than GP than Gemini 1.5 Pro so that's it for the audio performance and then in the end we have this bit more sopis icated test which tests multiple needles in the hch which is a little bit more realistic this basically means that they put several needles in the entire Corpus that is being tested and then they evaluate how many of those needles were actually retrieved which is much more realistic and as you can see well in this test also Gemini 1.5 Pro seems to be doing way better than GPT 4 Turbo so those are the performance results for Gemini 1.5 Pro published by Google and they actually do look very very groundbreaking let me know what you think about this in the comments but I I can't wait to use this model to build some applications finally let's talk about this pretty amazing feat that they were able to do with gini 1.5 Pro uh apparently they were able to teach gini 1.5 Pro a completely new language that it had never been exposed to all within a single context window which is just amazing if you think about it what they did is they took a 500 page grammar book and dictionary of this very rare language called calang that is spoken by fewer than 200 speakers in West Kia and they sent that document as context to the language model okay so they sent it within a single prompt and then they asked the language model to make some translations from English to caling and they found that the quality of these translations was comparable to that of a person who had learned from the same materials to me that is just impressive and the fact that this is made possible by this insanely large context window means that probably the next generation of language model applications are going to be very very different to the ones that we do today so that is pretty much how the model performs and the amazing feats that it's able to do now that wraps up our dive into Gemini 1.5 let me know what you think in the comments I'm very excited to see what you have to say and yeah be sure to stay tuned here in the channel because we are very soon going to be building very nice applications with Gemini 1.5 here using python so be sure not to miss that so yeah thank you very much for watching don't forget to subscribe and I will see you next [Music] [Music] time

Original Description

In this video, we cover what is Google's Gemini 1.5 Pro, its context window, and what sets it apart as the AI breakthrough of the year. 🚀 Dive into the latest breakthrough in artificial intelligence with us as we explore Google Gemini v1.5, a model that's redefining the limits of AI technologies. With its advanced Mixture-of-Experts architecture and unparalleled long-context understanding, Gemini v1.5 is not just an upgrade; it's a leap into the future of multimodal machine learning. -------------------- LINKS 📌 Official blog post: https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/ 💬 Join the Discord Help Server: https://link.alejandro-ao.com/HrFKZn ❤️ Buy me a coffee... or a beer (thanks): https://link.alejandro-ao.com/l83gNq ✉️ Get the Newsletter: https://link.alejandro-ao.com/AIIguB -------------------- 💡 What We Cover in This Video: Introduction to Gemini v1.5: Get to know the cutting-edge AI model, Gemini v1.5, and its unique capabilities. Multimodal Inputs Explained: Discover how Gemini v1.5 processes vast amounts of text, video, and audio data, setting a new standard for AI models. Mixture-of-Experts Architecture: Understand the innovative architecture behind Gemini v1.5's efficiency and adaptability. Long-Context Window Performance: See how Gemini v1.5 excels in processing and understanding extended contexts, outperforming other models like OpenAI's GPT-4 and Whisper. 'Needle-in-the-Haystack' Test Results: Witness the model's remarkable ability to extract precise information from vast datasets. Implications for AI's Future: Discuss the transformative potential of Gemini v1.5 across various industries. 🌐 Join us as we delve into Gemini v1.5's capabilities, from its rapid learning ability to its exceptional performance in long-document QA, long-video QA, and long-context ASR. We'll also compare it to other giants in the field like OpenAI Sora and explore its potential in video analysis and beyond. ✨ Why Gemini v1.5 M
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Alejandro AO · Alejandro AO · 26 of 60

1 Linear Regression in R - Full Project for Beginners
Linear Regression in R - Full Project for Beginners
Alejandro AO
2 Configure Webpack 5 in Wordpress (2025) with Typescript and SASS
Configure Webpack 5 in Wordpress (2025) with Typescript and SASS
Alejandro AO
3 R Programming 101 - Crash Course for beginners
R Programming 101 - Crash Course for beginners
Alejandro AO
4 Convert HTML template to WordPress Theme (2025) - Full Course
Convert HTML template to WordPress Theme (2025) - Full Course
Alejandro AO
5 Javascript Interactive Map with Leaflet EASY (with Marker Clusters & Popups)
Javascript Interactive Map with Leaflet EASY (with Marker Clusters & Popups)
Alejandro AO
6 Vanilla JS Project: Multi Step form in HTML, CSS & OOP Javascript
Vanilla JS Project: Multi Step form in HTML, CSS & OOP Javascript
Alejandro AO
7 How to do AJAX in WordPress correctly (2025)
How to do AJAX in WordPress correctly (2025)
Alejandro AO
8 React Leaflet Tutorial for Beginners (2025)
React Leaflet Tutorial for Beginners (2025)
Alejandro AO
9 Linear Regression in Python - Full Project for Beginners
Linear Regression in Python - Full Project for Beginners
Alejandro AO
10 Logistic Regression Project: Cancer Prediction with Python
Logistic Regression Project: Cancer Prediction with Python
Alejandro AO
11 Display Equations in ChatGPT
Display Equations in ChatGPT
Alejandro AO
12 Create a Chrome Extension (Manifest V3) for ChatGPT
Create a Chrome Extension (Manifest V3) for ChatGPT
Alejandro AO
13 Full-Stack Project | ChatGPT API, React, Node.js, Express
Full-Stack Project | ChatGPT API, React, Node.js, Express
Alejandro AO
14 Streamlit Python Course: Build a Machine Learning App to Predict Cancer
Streamlit Python Course: Build a Machine Learning App to Predict Cancer
Alejandro AO
15 Langchain PDF App (GUI) | Create a ChatGPT For Your PDF in Python
Langchain PDF App (GUI) | Create a ChatGPT For Your PDF in Python
Alejandro AO
16 LangChain Memory Tutorial | Building a ChatGPT Clone in Python
LangChain Memory Tutorial | Building a ChatGPT Clone in Python
Alejandro AO
17 Chat with a CSV | LangChain Agents Tutorial (Beginners)
Chat with a CSV | LangChain Agents Tutorial (Beginners)
Alejandro AO
18 Create a ChatGPT clone using Streamlit and LangChain
Create a ChatGPT clone using Streamlit and LangChain
Alejandro AO
19 Chat with Multiple PDFs | LangChain App Tutorial in Python (Free LLMs and Embeddings)
Chat with Multiple PDFs | LangChain App Tutorial in Python (Free LLMs and Embeddings)
Alejandro AO
20 Full Python Environment Setup for AI (or other) Apps + Virtual Environments
Full Python Environment Setup for AI (or other) Apps + Virtual Environments
Alejandro AO
21 Langchain + Qdrant Cloud | Pinecone FREE Alternative (20GB) | Tutorial
Langchain + Qdrant Cloud | Pinecone FREE Alternative (20GB) | Tutorial
Alejandro AO
22 LangChain Version 0.1 Explained | New Features & Changes
LangChain Version 0.1 Explained | New Features & Changes
Alejandro AO
23 Create a RAG Chain using LangChain 0.1 (New version)
Create a RAG Chain using LangChain 0.1 (New version)
Alejandro AO
24 Tutorial | Chat with any Website using Python and Langchain (LATEST VERSION)
Tutorial | Chat with any Website using Python and Langchain (LATEST VERSION)
Alejandro AO
25 Deploy Your AI Streamlit App for FREE | Step-by-Step (Heroku Alternative)
Deploy Your AI Streamlit App for FREE | Step-by-Step (Heroku Alternative)
Alejandro AO
What is Google's Gemini 1.5 Pro | 10 Million Token Window
What is Google's Gemini 1.5 Pro | 10 Million Token Window
Alejandro AO
27 Chat with MySQL Database with Python | LangChain Tutorial
Chat with MySQL Database with Python | LangChain Tutorial
Alejandro AO
28 Stream LLMs with LangChain + Streamlit | Tutorial
Stream LLMs with LangChain + Streamlit | Tutorial
Alejandro AO
29 Chat with MySQL Database using GPT-4 and Mistral AI | Python GUI App
Chat with MySQL Database using GPT-4 and Mistral AI | Python GUI App
Alejandro AO
30 #1 Harrison Chase: LangChain and The Future of LLM Applications | Alejandro AO
#1 Harrison Chase: LangChain and The Future of LLM Applications | Alejandro AO
Alejandro AO
31 CrewAI Step-by-Step | Complete Course for Beginners
CrewAI Step-by-Step | Complete Course for Beginners
Alejandro AO
32 Python: Automating a Marketing Team with AI Agents | Planning and Implementing CrewAI
Python: Automating a Marketing Team with AI Agents | Planning and Implementing CrewAI
Alejandro AO
33 Build a Web App (GUI) for your CrewAI Automation (Easy with Python)
Build a Web App (GUI) for your CrewAI Automation (Easy with Python)
Alejandro AO
34 Early days of RAG and LlamaIndex - Jerry Liu
Early days of RAG and LlamaIndex - Jerry Liu
Alejandro AO
35 LlamaParse: Convert PDF (with tables) to Markdown
LlamaParse: Convert PDF (with tables) to Markdown
Alejandro AO
36 #2 Jerry Liu - What is LlamaIndex, Agents & Advice for AI Engineers
#2 Jerry Liu - What is LlamaIndex, Agents & Advice for AI Engineers
Alejandro AO
37 CrewAI + Exa: Generate a Newsletter with Research Agents (Part 1)
CrewAI + Exa: Generate a Newsletter with Research Agents (Part 1)
Alejandro AO
38 #3 Joe Moura | Multi Agent Systems and CrewAI
#3 Joe Moura | Multi Agent Systems and CrewAI
Alejandro AO
39 Python: Create a ReAct Agent from Scratch
Python: Create a ReAct Agent from Scratch
Alejandro AO
40 New Groq Models: Best for Function-Calling Agents
New Groq Models: Best for Function-Calling Agents
Alejandro AO
41 Introduction to LlamaIndex with Python (2025)
Introduction to LlamaIndex with Python (2025)
Alejandro AO
42 LlamaIndex: How to use LLMs
LlamaIndex: How to use LLMs
Alejandro AO
43 LlamaIndex: How to Get Structured Data from LLMs
LlamaIndex: How to Get Structured Data from LLMs
Alejandro AO
44 Multimodal RAG: Chat with PDFs (Images & Tables) [2025]
Multimodal RAG: Chat with PDFs (Images & Tables) [2025]
Alejandro AO
45 Advanced RAG with LlamaIndex - Metadata Extraction [2025]
Advanced RAG with LlamaIndex - Metadata Extraction [2025]
Alejandro AO
46 Learn MCP Servers with Python (EASY)
Learn MCP Servers with Python (EASY)
Alejandro AO
47 Create MCP Clients in JavaScript - Tutorial
Create MCP Clients in JavaScript - Tutorial
Alejandro AO
48 Create an MCP Client in Python - FastAPI Tutorial
Create an MCP Client in Python - FastAPI Tutorial
Alejandro AO
49 How to Build an MCP Client GUI with Streamlit and FastAPI
How to Build an MCP Client GUI with Streamlit and FastAPI
Alejandro AO
50 Vibe Coding For Engineers (make it ACTUALLY work)
Vibe Coding For Engineers (make it ACTUALLY work)
Alejandro AO
51 LlamaExtract Tutorial: Convert PDF & Images into JSON
LlamaExtract Tutorial: Convert PDF & Images into JSON
Alejandro AO
52 Local MCP Servers for Cursor (Step by step)
Local MCP Servers for Cursor (Step by step)
Alejandro AO
53 Anthropic: How to Build Multi Agent Systems
Anthropic: How to Build Multi Agent Systems
Alejandro AO
54 Deploy Remote MCP Servers in Python (Step by Step)
Deploy Remote MCP Servers in Python (Step by Step)
Alejandro AO
55 GPT-5 for Developers: API Changes, Pricing, Model Router & Security
GPT-5 for Developers: API Changes, Pricing, Model Router & Security
Alejandro AO
56 Tutorial: Auth for Remote MCP Servers (Step by Step) | OAuth 2.1 with ScaleKit
Tutorial: Auth for Remote MCP Servers (Step by Step) | OAuth 2.1 with ScaleKit
Alejandro AO
57 Generate UI Tests with TestSprite MCP Server + TRAE
Generate UI Tests with TestSprite MCP Server + TRAE
Alejandro AO
58 #4 Allan Guo | 19-yo YC Founder - Willow Voice
#4 Allan Guo | 19-yo YC Founder - Willow Voice
Alejandro AO
59 RAG Project: Build an AI Onboarding Chatbot with Streamlit, LangChain, and ChromaDB
RAG Project: Build an AI Onboarding Chatbot with Streamlit, LangChain, and ChromaDB
Alejandro AO
60 MCP Security | Malicious MCP Servers (Protect Yourself)
MCP Security | Malicious MCP Servers (Protect Yourself)
Alejandro AO

Google's Gemini 1.5 Pro is a groundbreaking language model that delivers dramatically enhanced performance using a multimodal model and mixture of experts architecture. The model has a large context window of up to 10 million tokens, allowing it to process large amounts of text and learn new languages within a single context window. This video covers the key features and benefits of Gemini 1.5 Pro, including its performance in audio and multiple needles in the haystack tests.

Key Takeaways
  1. Understand the architecture of Gemini 1.5 Pro
  2. Learn about the large context window and its benefits
  3. Explore the performance of Gemini 1.5 Pro in audio and multiple needles in the haystack tests
  4. Discover how to use Gemini 1.5 Pro to build applications with Python
💡 Gemini 1.5 Pro has a large context window of up to 10 million tokens, allowing it to process large amounts of text and learn new languages within a single context window.

Related AI Lessons

Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →