Getting Started with Google Gemini 2.5 Pro: Detect Objects, Generate Captions & OCR

Muhammad Moin · Beginner ·👁️ Computer Vision ·1y ago

Skills: CV Basics80%

Key Takeaways

This video teaches how to use Google Gemini 2.5 Pro for object detection, image captioning, and optical character recognition

Full Transcript

Hello everyone. In this video tutorial, we will explore how we can use Google Gemini 2.5 for different vision task including object detection, image captioning, and optical character recognition task. So let's start with what is Google Gemini 2.5. Google Gemini 2.5 is a vision language model released in Pro and Flash versions. So you might be thinking what is a vision language model? A visual language model can understand an image, describe an image and it can look at an image as well. Plus it a vision language model can also read the text and it can generate the text as well. So a vision language model consist of two things. One is the vision part and other is the language part. In terms of vision part, a vision language model can understand, describe, look at an image. And in terms of language part, a vision language model can read the text and it can also generate the text as well. So, Gemini 2.5 is released in pro and flash versions and both versions are multimodel like uh they can accept multiple types of inputs. It can accept text input. It can accept image input. It can accept audio input. It can accept video inputs as well. And Google Gemini 2.5 Pro and Flash model can process up to 1 million tokens of context. So, as I told you, Gemini 2.5 is released in Pro and Flash version. Uh like Gemini 2.5 Pro uh outperforms like flash version as well. So, Gemini 2.5 Pro is designed for maximum capability and it delivers strong results on DOS such as code generation, long context reasoning, document analysis and multimedia understanding as well. While Gemini 2.5 flash provides a balance of quality and efficiency with lower compute and latency requirements. So over here the performance comparison of Gemini 2.5 Pro and flash models is done with other open-source models and you can see and close source models as well and you can see that Gemini 2.5 Pro outperform Gemini 2.5 flash both of these Gemini 2.5 Pro and flash models out outperforms many opensource and closed source models as well. Okay, so let's get started in the step number one. uh like you can see that I will just delete the previous runtime and restart the runtime again. So in the step number one uh we will install the Google generative AI because we want to use the gem 2.5 pro model and it's available under the Google generative AI package. After installing all those packages then we will import all the required libraries. We will using open cv python which is imported over here as cv2. Uh because uh we require this library open cv python to draw bounding boxes around the detected object. Plus we want to add the label above each of the or the class name above each of the detected object. So for this we will use the openc python package. Okay. So like uh then we'll also use the numpy library as well. So like you might be thinking why didn't we install the open s Python and numpy packages because they are pre-installed in the Google Collab and to display an input or output image in Google Collab notebook we require the image library. So first of all we'll add initialize the Gemini client with the API key. So to get your Google Gemini API key you will just write Google Gemini API key over here and you can just go to this link and you can just click on create API key from here. And you will just select an existing project. You will just create API key. And now you can just copy this API key from there. Okay. So now we have initialized the Gemini client with the API key over here. So now uh I will be using Gemini 2.5 Pro because among the two variants pro and flash, Gemini 2.5 Pro outperforms the flash model. So I will be using Gemini 2.5 Pro in this tutorial. Okay. So now we'll just creating an inference function. The input arguments to this function will be the image prompt and the temperature which we have set for 0.5. Temperature value defines the how much creativity you want in your output. Okay. So so we have the image as an argument. We have the prompt as an argument. In the image we will pass the input image. The user will pass an im input image and in the prompt the user will pass a text prompt. Okay. So now you can see that uh here we have initialized the Gemini client with the API key. So now we are just trying to assess the Gemini 2.5 Pro model. Okay. So uh and in the contents I will just pass the input prompt like the text prompt and the input image over here as well. And here we have to set the temperature value which is set to 0.5. Uh which controls the creativity versus determinism in output. Okay. And the output of this function will be the text response that is generated by the generate 2.5 ro model. So we now download some example images uh that we will use for the testing purpose. So I'm downloading those sample images from drive directly into this Google collab notebook. Okay. So now you can see over here we have this sample image. This is the Bengal claim image. Okay. This is a bus image over here. Okay. So like this is image like different players are playing soccer over here. Okay. So now in the step number one we will read the image using open s Python and uh this will return me the I and now here I will just using the image library which we have imported from bil so that we can display the any input or output image. Okay. So if I just want to display this socket image, I will just click over here, copy path. And you can add simply add this path over here. And if you run this function, this will display the socket image over here in this Google Collab notebook. Okay. So now as I told you that we will be running different vision task including object detection, image captioning and uh optical character recognition task using Gemini 2.5 Pro. Okay. So let's first perform object detection tasks. Germany model sport object detection helping you to identify and recognize multiple objects with an image. So here you can see this is an example prompt which I provided. You can add that prompt as well. So what prompt I have added over here? Detect the 2D bounding boxes of objects in image. Detect the objects as accurately as possible. Review the results before generating the response. Make sure to generate correct bounding box coordinates. Okay. So this is my input prompt and here is the output prompt. I will just combine input prompt and output prompt uh when I will pass into this inference function. So return just boundary box coordinates and label no additional text. And here we are just uh reading the image using OpenC Python. And we are just calling the read image function below. Okay. And in the results we will have the output which will be the bounding box coordinates and the labels. So let's run this. So this will take few seconds. So let's wait and until we get the response. So you can see we have the response over here. So sorry I'm just rerun this again. So this will take few more seconds. Uh let's wait for a response. Okay. So we have the results over here. So now we'll just clean the data, remove this scripts from the start and let's see uh how our clean dash looks like. So now you can see that if I just So now you can see here we have this created a function clean results. So we can clean the results for visualization. And now here I've displayed a function draw draw bounding boxes. So over here you can see I'm just using openc python rectangle function and put text function so that I can draw bounding boxes or rectangles around each of the ducted object and using put text function I will just add the text above each of the bounding box as well. Okay. So let's run this. So now you can see that Gemini 2.5 Pro out outperforms or generates very good results in object detection. Like you can see that we are able to detect the persons in this image as well as the sports ball. Like the results look very fantastic. Like you can see that we are able to detect all the persons. Like this person was very blur but the Gemini 2.5 Pro model was able to detect this person as well. So like you can see we are able to detect all the six persons that are in this image as well as the scores ball is also detected. Okay. So now let's perform this object detection on other input image as well. So I have this image like here we have the bus and we have this four persons which I can see over here. So now I will just define the text prompt. Detect the 2D bonding boxes objects in image. Detect the objects as accurately as possible. Review the results before generating the response. Make sure to generate correct bounding box coordinates. Okay. And here we have the output prompt. Return just bounding box coordinates and labels. No additional text. And here you can see we are just passing our input image from here. You can just copy this from here and just pass this input image. And let's see if we are able to detect all the four persons and the bus in this image or not. Okay. So after we have the bing box coordinates over here, we will just clean the results for visualization. Then uh I will just call this function again like the draw boxes with read image. Like you can see in this function we are using openc python to draw bounding boxes around each of the text directed objects. Plus we are also adding a label above the bounding box using put text function in open cv Python. Okay. So that looks quite promising and here you can see that we are just scaling the bounding box coordinates over here. So let's see if we okay so here you can see we have the results and now I will just see. So now you can see that we are able to detect the bus. We are also able to detect the four persons. One, two, three, four persons we have and the bus. So the results look quite amazing like Gary 2.5 Pro is performs very well in terms of object detection. Let's see how Gary 2.5 pro performs in terms of image captioning. So we have this input image where different persons are claiming their baggage. Okay. So here is the prompt. Look at the image and reate a detailed caption like we are using image im doing image captioning in one or two sentences that clearly describe what you see. Mention the main object, their actions and the overall see be clear and accurate and do not get guess anything that isn't visible. Okay, so let's see if we can use Gemini 2.5 Pro for image captioning or not. So this will take few seconds. Like you can see that it's currently in progress. After image captioning, we will see how we can use Gemini 2.5 Pro for optical character recognition. So here you can see we have the image caption. A diverse group of travelers wait at an airport baggage claim claim for their luggage. One man in a light blue shirt bands down to retrieve a black suitcase from the moving belt while other passengers watch and wait. Okay, light blue shirts is claiming the baggage. Like that's correct. Now we can see how we can use Gemini 2.5 Pro for optical character recognition. Uh so now we can see how we can extract the text and how we can extract the bounding box coordinates as well. So extract all the visible text from the provided image as accurately as possible. Carefully review the extracted results before generating the response to ensure no text is missed or inter misinterpreted for each text element. Include correct bounding box coordinates. The bounding box must precisely match the location of text in the image. Do not assume any text that is not clearly visible. And here we have the output prompt. Return just bounding box coordinates which will be location of detected text areas plus label. Okay. So we have this handwritten image that I'm passing in the input. I want to extract a text of this handwritten image. Okay. So my aim is to extract a text of this handwritten image. So let's run this and see what results do we get. So this will take few seconds before it generates a response for us. Then we'll just clean the results and then so this will take few more seconds before we get the response. Okay. Um let's run this one more time. I think I'm not getting what I'm expecting. So let's just see. Okay. So, uh now we'll just run this again. Okay. Uh no, like you can see that in this case we are not getting a good response. We might need to update the input prompt as well. So, let's run this one more time as I will update the input prompt next time. Let's see. Okay. Okay. I'm not getting the bonding box coordinates are not correct. I'm not sure what the issue is getting. I might need to update the bonding box coordinates. I might need to update the prompt, okay, to get the some accurate bonding box coordinates. Currently, it's not providing the correct bounding box coordinates. No lightning, I will update the prompt and I will show you the updated results now. So I am just rerun this again. I have not changed the prompt and now you can see that it has generated uh like good results like you can see that it is able to extract the text and it is able to provide the correct bounding box coordinates as well. Okay. So now you can see that we have extracted the text. This is a handwritten example. Write as good as you can. So we can see that a Gemini 2.5 Pro gives some amazing results when it comes for to optical character recognition as well. So now you can see that we are able to extract the text from the image handwritten text from the image. We are also able to extract the bounding box coordinates as well. And now you can see that we are able to extract the correct bounding box coordinates and we are able to extract the text from the image as well. Okay. So that's all from this tutorial. In this tutorial, we have explored Google Geminate 2.5 Pro model and we have seen that how we can use this model for object detection, image captioning and for optical character recognition tasks. So that's all from this tutorial. Thank you for watching.

Original Description

In this video tutorial, we explore how to use Google Gemini 2.5 Pro for Object Detection, Image Captioning, and Optical Character Recognition (OCR). Gemini 2.5 is Google’s advanced vision-language model, available in two versions: Pro and Flash. Both variants are natively multimodal, supporting text, image, audio, and video inputs, and can process up to one million tokens of context. Gemini 2.5 Pro is designed for maximum performance, delivering strong results across tasks such as code generation, long-context reasoning, document analysis, and multimedia understanding. On the other hand, Gemini 2.5 Flash is optimized for efficiency, offering lower compute and latency requirements while maintaining high-quality output. The model sets new benchmarks for performance and scalability, achieving 74.2% on LiveCodeBench (coding), 88% on AIME 2025 (math), and 82% on MMMU (image understanding). Code: https://github.com/MuhammadMoinFaisal/Gemini-2.5-Pro-Object-Detection-Image-Captioning-OCR/blob/main/How_to_use_google_gemini_models_for_object_detection_image_captioning_and_ocr_.ipynb *🧑🏻‍💻 My AI and Computer Vision Courses⭐* *📗YOLO26 Bootcamp: Real-Time Detection, Segmentation & Pose (13$)* https://www.udemy.com/course/yolo26-bootcamp-real-time-detection-segmentation-pose/?couponCode=PROMOTION10USD *📘Hands-On RAG Bootcamp: Build Apps with LangGraph & LangChain (13$)* https://www.udemy.com/course/hands-on-rag-bootcamp-build-apps-with-langgraph-langchain/?couponCode=PROMOTION13USD *📙Complete Computer Vision Bootcamp: YOLO to Multimodal AI (13$)* https://www.udemy.com/course/complete-computer-vision-bootcamp-yolo-to-multimodal-ai/?couponCode=PROMOTION13USD *📚 Generative AI, LLM Apps & AI Agents Masterclass 2025 (13$)* https://www.udemy.com/course/ai-agents-with-n8n-automate-anything-with-no-code/?couponCode=PROMOTION13USD *📘 YOLOv12 & YOLO26: Custom Object Detection & Web Apps 2026 (13$)* https://www.udemy.com/course/yolov12-custom-object-detection-tracking-webapps

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

More on: CV Basics

View skill →

Identify Horses or Humans with TensorFlow and Vertex AI

Building a Dog Breed Identifier App from scratch - DogNet

Building a Dog Breed Identifier App from scratch - DogNet

Aladdin Persson

Apply OpenGL Texturing and Camera Systems

Apply OpenGL Texturing and Camera Systems

Aerial Image Segmentation with PyTorch

Aerial Image Segmentation with PyTorch

How to Install Stable Diffusion - automatic1111

How to Install Stable Diffusion - automatic1111

Sebastian Kamph

NVIDIA RTXGI Unreal Engine 4 Plugin: Introduction and Setup

NVIDIA RTXGI Unreal Engine 4 Plugin: Introduction and Setup

NVIDIA Developer

Related Reads

Go Concurrency: The Matrix of Goroutines

Learn to manage concurrency in Go using goroutines and channels to write efficient programs

Dev.to · Timevolt

How the Internet Works: A Beginner's Guide to Networking from Browser to Server

Understand the basics of internet networking from browser to server, including DNS, IP addresses, and TCP/IP

Dev.to · Adeje Oluwatobiloba

Best Vision AI inspection companies in India | 2026

Discover top Vision AI inspection companies in India for improved manufacturing accuracy and efficiency

Top machine vision companies in India

Learn about top machine vision companies in India and how they improve product quality and inspection accuracy through AI-powered imaging and automated inspection systems

9-Phase Computer Vision Roadmap 2026 | AI & Deep Learning | #shorts