Image Caption Generator: Google Colab and Hugging Face

AI Anytime · Beginner ·🔧 Backend Engineering ·3y ago

Skills: LLM Foundations80%CV Basics60%ML Pipelines50%

Key Takeaways

This video demonstrates the use of a pre-trained image caption generator model in Google Colab using Hugging Face and Python, with plans to build an API using FastAPI in a future video.

Full Transcript

hello everyone welcome to AI anytime Channel so in today's video we'll see that how we can build an image caption generator app so we'll be using a deep learning model basically transfer model to perform this task where the input image input will be an image and in output will will get the caption for that image okay so you can see that I have uh a cast ready here and I'll be explaining you that how we are going to build this uh image caption generator app so in beginning we'll be using a deep learning model to perform the task and the later on we'll be controlling that deep learning model and then all the application so for application we'll be using uh string lead and we'll also try to take this deep learning model and we'll try to build an API using fast API so that way we can also use this AI as a service right so we call AI as an API right so how it's going to work so we are going to use a encoder decoder model for performing this task right so I just said my input will be an image and then we'll perform some operation on that image right so an output should be and out should be a caption so there are mainly two ways of achieving this task when I talk about a deep learning so either you take some data set of image and the respective captions and train a machine or deep learning model to generate the caption right that's the one way of doing it for that you will need some kind of data set to train it so there are multiple data set that you can use Flickr 8K of Flickr 30k for example that you can use it the other way that you can use pre-trained model and you will just try to use the waves and try to make some kind of prediction so basically inferencing on top of those pre-trained models right so we can uh utilize couple of pre-trained models from hugging phase and then we'll try to perform the task so in this video we'll be going with pre-trained model I'll try to see how we can use an existing model to achieve the achieve the task right so as you can see you have an input image and then first I will be passing this through an encoder model to get the encodings right so just to give you a brief like for this kind of task both uh natural language processing and computer Visions are involved right so NLP and CV both are involved so when you have an image you first have to pass through an encoder model that can be uh like wheat or swing for example there are many encoder models where you will get the encodings for your image right that the image that you have that can be one image or the multiple images at the same time you will get the respective uh we'll get let me bring my monitor up you can get the encoding for that image right and when you have that encoding we'll pass it through an auto regressive let me write it here to regressive or decoder model decode right so these are basically the language model can be an example and then you get the caption so if you talk about the decoder model uh the auto regressive or decoder model there are many language models right so for example we have Roberta very famous language model then you have gpt2 right so robot gpt2 Etc different decoder models can be combined with the encoder model to perform this task so basically it's a infusion of encoder and decoder model to achieve this this is a a very high level flow that you pass an input image and in return you get the output caption right so this is what we are going to use so you can see let me bring my colab so we'll be using Google collab today okay let me connect my monitor with this okay so we'll see how we can use Transformers to perform this task so let me change the runtime to get GPU accelerator you can use GPU it works fine with CPU as well okay it might take a little time to do the inference okay connect and I'll be installing Transformers I'll do pip install okay this one times let me do one thing let me go to increase the font size a bit I will make it 18. okay or maybe I can make it 20 why not so 20 and it looks fine right so I'll do quick install Transformers excuse me and it'll take little time now that's more interesting note up okay you can run this locally as well I'll be using Google collab for this kind of task where I'm using a pre-trained model so we'll use this and see how it's working if it works fine I will try to take this uh pre-trained model and try to build an API on top of this and later on we'll try to use this API through Postman or maybe we can integrate that into an existing or a new app that we are building in streamlit or any other web or mobile application right so from Transformers import provision in culture and Decor model okay and we need vit [Music] we need some kind of feature extractor and we'll be using this feature extractor for the encoder model and then we need the tokenizer so we need the auto tokenizer perfect and we'll be using potatoes as the backend framework using import Dodge and then we need is below so sorry from PM input image gives me image the import okay so as Imports are now successfully done what will be going to do is that we are going to take the source from hugging face right so we are going to use a pre-trained model so the links will be also given in the description so we'll be using this wheat gpt2 image captioning okay I was talking about rights and encoder and decoder model okay so for that let me go back okay and uh you can get the source from here okay uh that NLP connect this is the source you can also influence your sample images over here right you can try it out okay so what I'm going to do I'm going to use this uh uh a method that I have imported successfully right so Vision encoder decoder model Android from pre-trained okay so dot from the train and then I will just give that source that the hugging phase source that we have we do not need to give the entire uh hugging phase uh URL just pass it the excuse me okay it will be downloading the model from there it will take little time depending on your internet bandwidth now what I will do I'll also uh download the feature extractor right so we need some kind of feature extractor so we'll be using this V that we have imported ever right create feature extractor Dot from pre-trained and again that retrained because we are going to use a pre-trained model I will pass this excuse me I'm just passing this so let me copy this next time you also need that auto tokenizer so I'll just paste it there and yeah maybe I can just uh also Define the tokenizer here so I'll say tokenizer and then Auto tokenizer right so uh Auto tokenizer Dot from pre-trained and in this free Trend I will just pass this corporate hugging face source that I have copied from above right Auto tokenizer you see that this model has been downloaded the weights Etc and this should do good let's see so we can see right kinds of takes time to download the feature extractor and tokenize your from hugging face and let it start right now we can use GPU but if you have single image and you just want to do an inference it might take some few seconds to few minutes depending on the uh the processing power that we have right but let's see how we can so I'll say device uh that you can get it on the torch documentation as well so torch device and then I'm going to say Coulda equal if torch dot Cuda is is available so we are checking if Buddha is available else use the CPU okay and then we just pass model to device okay so model.2 and then pass the device so what are we doing in this step is to trying to find out uh that if there is any GPU available okay so if there is any GPU available which can use Cuda okay then that will that will be working as an accelerator if not then we'll use the CPU as it is right so model.2 device and like we take let me bring the CPU task manager here how much is okay so then we need to define a few constants and then quarks and arcs that will Define it here so uh if you are not uh if you are not using pytorch okay on your day-to-day deep learning activities okay you can explore about maximum length and number of beams okay and then how we can use that those as quarks right so what I'm going to do I'm going to say uh maximum length and to find some value which is 10 and then number of things equals 4. and then I will just use this get generate works variable and then select dictionary so we have to pass this as a dictionary so what I'm going to do I'm going to say max length Max underscore length equal uh alone max length max length and good and excuse me and then we have uh number of beams so number of beams equals uh you know games support okay let's run this so we have defined uh some of the parameters right that we will use when we are writing the function for the caption generator right so I'll just put Define predict steps so we'll use that function that you can also find it on hugging face which kind of takes the path of the image so you can have one or multiple images there right so I'll just use that it take your image path and then we'll write the function within it so we'll have a we'll Define a list we'll empty list that will contains all your images and then we'll say for image paths in for image part in images part okay or image sorry image pass okay and then okay and then we just take them here index of the image and my image and then we'll open that image so excuse me image dot open okay okay image dot open and we'll pass this image path here we pass the image path good what next we can check if the image is RGB or not so if I image dot mode right I image dot mode not equals to okay not equals to RGB we can convert our image dot I underscore image dot convert uh so if we are checking that if that image is not an RGB mode okay then we can convert that using this convert method and we can just pass you can see it's accept this argument of the parameter which is mode here so we'll say mode equals uh RGB okay then what we can do we can within this for Loop we can append this uh so we'll say images dot append hmm image let me see if this looks good so what we are doing so we are doing for image path in image path and I image equal to image dot open image path and then if converting to RGB if it's not and then we are using this image as an input okay that's it but next then we need the pixels value the encoding that we are doing right so if you remember if I go back to this casting sorry if the casting has been stopped let me sync up the casting ones back let me see if I can pull this casting back you can see it over here right that we have this image and we need to get this encoding and then we pass it to a decoder model the language model or llm right large language models like robota or gpt2 for example right so I have different devices connected with the main device that I am using to write the code okay so I have to suffer a lot today then we have images dot append and then I image then what we can do we can have to get the pixel value so let's say pixel values equals you have to pass it to feature extractor right so all the feature extractor the variable that we Define above and then we'll pass images equals to images the images where we appended the I underscore image right in the above line of code and then we have to return the tensor and so we'll say return 10 so we are using pytos and say PP and then we need just a pixel value pretty much self-explanatory so if we have been working with uh python it will be easy for you to understand and this is not a python Series so I'm not covering those things maybe we can cover it in a separate pie torch or uh playlist so pixel values equal to pixel of values Dot 2 and then you can use this device okay let me see okay so then we had appended the image and then we have extracted the feature okay I'm passing images equal to images return tensors equals to PT and pixel values and then what we can do now the output IDs case 4 for the images that we have so output ID is equal to in to have to use the model to generate uh based on the pixels value so pixel values and then we have to pass that uh that we have defined the parameters above right Within what was that yeah here it works okay then we have got the output IDs so what next uh pixel values feature extract turn then output IDs then what next is the could be the predictions so predictions it's equal to we have to use the tokenizer so tokenizer dot Dash and recognize.nash University to decode so match decode so I will batch decode just let me check what we have imported envelope so we imported how to recognize your future exercise foreign output right degrees and we can skip the special tokens if is there any okay so we'll say skip space or tokens and then we'll just say true the prediction and then we say predicts equal let's pass it to click dot strip okay which is the method and then four in that thread in that variable that we have the predictions that's it and then we'll just return this Scripts I think this should do so let me see so we have we have wrote a function predict underscore step and in that we have started uh with the image pre-processing I'll say okay that we are loading those images we have first one uh list empty list of images and then we are finding out if that image mode is not equal to RGB will convert those to RGB and then we'll then we'll uh will be appending those images list that we defined above with all the input image that we will have that can be one or more than one also okay then we have the feature extractor for each image right and we are returning the tensors because we're using we are using torch here so you can see the TT and then we are getting the pixel the value so we are getting the encoders okay and then we are passing it to the model to get the decoding done and then we are just returning the prediction function our predictions there right let's see perfect let's run now so what I will do well I have to see I'll upload couple of images here okay so I will upload these two image okay uh you can also Mount this collab with the Google Drive so you can import all of your images or the data after mounting it with the collab right you can keep you can use those uh Drive uh folders or files right similarly it happens on the cloud notebooks as well you connect with if you talk about the sage maker you can connect with sagemaker with awos S3 right and similarly happen with gcp as well we have Google cloud storage and same thing with Azure as well right similarly also you can mount this collab notebook with the drive and then you can do it okay so I'll just uh have imported it here directly in the runtime so whenever you close this session or the session kind of gets finished you will not be able to use this image images of the data anymore okay so what I'll just do I'll try to just use this predict function click underscore step and I'll just pass this as a list if you see what we have defined so I'm just saying let's pass one of the image first and see once it returns and I fit right it might take a little time few seconds so it took around 8 seconds for this image you can see it over here right and it's giving you a man riding a horse on top of a beach if I open this okay image you can see this image right it's a man kind of what it says or the caption that we received uh a man riding a horse on top of the beach walking pretty good isn't it uh I really like it let's do one thing let's try the other image as well so I have one more which is jpeg right so let me change the extension and see so you can try with multiple images or you can try go and try with different images I'll save that code with you while share the GitHub repository of this code in the description you'll be able to find it so what I'm going to do now let's see this image okay I'll just make the we'll save this image you can see a tennis player right kind of Rafael kind of playing some tennis Source right so you see a man holding a tennis racket on top of a tennis court amazing I learned it right so you saw that how we kind of uh created this model sorry use this model and said either we are not creating a model here maybe we can also create a model uh we can take the data from Flickr 8K or Flickr 30k maybe even trying when this when we complete this API maybe we can capture in the third or fourth video we can train a similar model and we can compare the performances of both this pre-trained model and the model that we will train so in this model that you saw right what we did we started with we have an input image we are using some kind of encoder model and we are getting the encodings and then we are passing it to a auto regressive or decoder model to get the caption so you see how both NLP and computer Visions are infused together to achieve the results that we want right an image caption generator a very powerful applications I have been using image caption generator Facebook for example suggest you some kind of captions right so now a lot of organizations basically in broadcasting media entertainment they are using image caption generator okay inbuilt with it they kind of suggest you or recommend you based on the image that you upload or you try you are trying to post right before that so what I'm going to do next is we are going to take this pre-trained model from hugging phase and we can actually take it from this collab notebook now and we'll try to use fast API we'll try to build an API okay and that API can be consumed uh anywhere in an application so this word will cover in our next video so till now you can take this code or you can take it from directly from hugging face and try to uh take some sample images and try to use this model and see how the performance is for your images right and let me know your thoughts in the in the comment box I hope you like this video if you like if you like the content and content that I am creating please like share and subscribe the channel as well thank you so much

Original Description

In this video, we used a pre-trained image caption generator model based on the power of transformers and the Python programming language. In future video, with the help of the FastAPI framework, we will build an API that allows users to easily generate captions for any image, all powered by cutting-edge AI technology. Try it out for yourself and see the results! GitHub Link: https://github.com/AIAnytime/Image-Caption-Generator-App Hugging Face link (Credits): https://huggingface.co/nlpconnect/vit-gpt2-image-captioning #artificialintelligence #python #deeplearning #ai

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AI Anytime · AI Anytime · 3 of 60

← Previous Next →

Spelling and Grammar Checking Streamlit App: Building Docker Image

Spelling and Grammar Checking Streamlit App: Building Docker Image

Spelling and Grammar Checking Streamlit App: Docker Image and Docker Hub

Spelling and Grammar Checking Streamlit App: Docker Image and Docker Hub

Image Caption Generator: Google Colab and Hugging Face

Image Caption Generator: Google Colab and Hugging Face

Low Code/No Code AI Platform Teachable Machine: Brain MRI Image Classification

Low Code/No Code AI Platform Teachable Machine: Brain MRI Image Classification

Low Code/No Code AI Platform Teachable Machine: Testing the Model

Low Code/No Code AI Platform Teachable Machine: Testing the Model

Low Code/No Code AI Platform: Streamlit App for Brain MRI Image Classification

Low Code/No Code AI Platform: Streamlit App for Brain MRI Image Classification

Readme Generator Streamlit App using ChatGPT

Readme Generator Streamlit App using ChatGPT

Generate Minutes of Meeting (MoM) from Video using ChatGPT: AI as an API

Generate Minutes of Meeting (MoM) from Video using ChatGPT: AI as an API

The Great AI Showdown: ChatGPT vs ChatSonic 🔥

The Great AI Showdown: ChatGPT vs ChatSonic 🔥

Generating Transcripts and News Article with Whisper, GPT-3.5, ChatGPT and Streamlit

Generating Transcripts and News Article with Whisper, GPT-3.5, ChatGPT and Streamlit

Toxicity Classifier using Machine Learning and NLP

Toxicity Classifier using Machine Learning and NLP

Toxicity Classifier API using FastAPI

Toxicity Classifier API using FastAPI

Toxicity Classifier Streamlit App

Toxicity Classifier Streamlit App

Low-Code Insurance Prediction with PyCaret and Streamlit

Low-Code Insurance Prediction with PyCaret and Streamlit

Deploy Streamlit Python Application for Free

Deploy Streamlit Python Application for Free

GPT3 Powered Text Analytics App

GPT3 Powered Text Analytics App

AI Image Generation Streamlit App

AI Image Generation Streamlit App

Streamlit and txtai: Building an Abstractive Summarization App in Python

Streamlit and txtai: Building an Abstractive Summarization App in Python

Building a Topic Modeling and Labeling app with Streamlit

Building a Topic Modeling and Labeling app with Streamlit

The Art of AI: Exploring Midjourney, Dall-E, and Lexica

The Art of AI: Exploring Midjourney, Dall-E, and Lexica

Exploring the latest Large Language Models (LLaMA and Alpaca)

Exploring the latest Large Language Models (LLaMA and Alpaca)

Comparing LLMs like GPT-X, LLaMA, and Alpaca: Analyzing the Perplexity Score

Comparing LLMs like GPT-X, LLaMA, and Alpaca: Analyzing the Perplexity Score

GPT-3 powered Q&A App using Langchain, GPT-Index, and Gradio

GPT-3 powered Q&A App using Langchain, GPT-Index, and Gradio

All things #ai . Latest and greatest in AI. #tech #python #chatgpt #youtubeshorts #shorts #gpt3

All things #ai . Latest and greatest in AI. #tech #python #chatgpt #youtubeshorts #shorts #gpt3

Text-to-Video Generation using a Generative AI Model

Text-to-Video Generation using a Generative AI Model

#ai brand name generator. #artificialintelligence #tech #shorts #youtubeshorts #youtube #chatgpt

Talking AGI with Sam Altman: A Deepfake Showcase

Talking AGI with Sam Altman: A Deepfake Showcase

A conversation with ChatGPT creator Sam Altman. #tech #technology #ai #shorts #viral

A conversation with ChatGPT creator Sam Altman. #tech #technology #ai #shorts #viral

Get to Know Anthropic's Claude: The Ultimate ChatGPT Competitor

Get to Know Anthropic's Claude: The Ultimate ChatGPT Competitor

#shorts #chatgpt #python #datascience #tech #coding

#shorts #chatgpt #python #datascience #tech #coding

Recipe Generator App from Cooking Videos using Whisper and ChatGPT

Recipe Generator App from Cooking Videos using Whisper and ChatGPT

Segment Anything Model by Meta AI: An Image Segmentation Model

Segment Anything Model by Meta AI: An Image Segmentation Model

One of the best #ai #books based on #tensorflow. #tech #coding #shorts #chatgpt #machinelearning

One of the best #ai #books based on #tensorflow. #tech #coding #shorts #chatgpt #machinelearning

Music Generation using Mubert #ai . #music #shorts #youtubeshorts #chatgpt #generativeai

Music Generation using Mubert #ai . #music #shorts #youtubeshorts #chatgpt #generativeai

Image to Text Prompt: Reverse Engineering AI Image Generation

Image to Text Prompt: Reverse Engineering AI Image Generation

Image Generation for #ramadan using #ai. #midjourney #chatgpt #shorts #youtubeshorts #islam

Image Generation for #ramadan using #ai. #midjourney #chatgpt #shorts #youtubeshorts #islam

How to build an AI-ready organization: Cultivating a Data-Driven Culture

How to build an AI-ready organization: Cultivating a Data-Driven Culture

Midjourney: Generate AI-powered Images

Midjourney: Generate AI-powered Images

Getting Started with Graphs: A Beginner's Guide (Part 1 of GNN Series)

Getting Started with Graphs: A Beginner's Guide (Part 1 of GNN Series)

Build India's First ChatGPT like App for Politics: BJP-GPT

Build India's First ChatGPT like App for Politics: BJP-GPT

Meet BJP-GPT.... @AIAnytime #bjp #news #shorts #tech #chatgpt #ai #youtubeshorts #coding #video

Meet BJP-GPT.... @AIAnytime #bjp #news #shorts #tech #chatgpt #ai #youtubeshorts #coding #video

ChatPDF... #chatgpt for PDF files. #ai #generativeai #shorts #youtubeshorts #coding #tech #ai

ChatPDF... #chatgpt for PDF files. #ai #generativeai #shorts #youtubeshorts #coding #tech #ai

Free AI Image Generation #ai #chatgpt #coding #tech #shorts #youtubeshorts #shortvideo #generativeai

Free AI Image Generation #ai #chatgpt #coding #tech #shorts #youtubeshorts #shortvideo #generativeai

Transform old photos into Vibrant Memories with Deoldify AI: Build a Streamlit App

Transform old photos into Vibrant Memories with Deoldify AI: Build a Streamlit App

Open Assistant: The Real Open-sourced LLM

Open Assistant: The Real Open-sourced LLM

Thanks to @YannicKilcherand team for the open sourced LLM Open Assistant. #ai #shorts #tech

Thanks to @YannicKilcherand team for the open sourced LLM Open Assistant. #ai #shorts #tech

Search Engine for AI generated images. #ai #tech #technology #generativeai #chatgpt #shorts #video

Search Engine for AI generated images. #ai #tech #technology #generativeai #chatgpt #shorts #video

Generative AI Video Platform "Synthesia" #shorts #youtubeshorts #ai #tech #chatgpt #generativeai

Generative AI Video Platform "Synthesia" #shorts #youtubeshorts #ai #tech #chatgpt #generativeai

Text to speech Voice AI platform. #shorts #youtubeshorts #ai #tech #technology #python #coding

Text to speech Voice AI platform. #shorts #youtubeshorts #ai #tech #technology #python #coding

Create Amazing Videos with ChatGPT and Pictory: Free AI-powered Video Creation

Create Amazing Videos with ChatGPT and Pictory: Free AI-powered Video Creation

Want to create beautiful video using #chatgpt and #pictory ? Watch the tutorial on channel. #ai

Want to create beautiful video using #chatgpt and #pictory ? Watch the tutorial on channel. #ai

Animate your photos using AI. Bring old family photos to life. #ai #tech #shorts #shortvideo #coding

Animate your photos using AI. Bring old family photos to life. #ai #tech #shorts #shortvideo #coding

Create a PDF Search and Summarization Tool in less than 100 Lines of Code: GPT-Index and Streamlit

Create a PDF Search and Summarization Tool in less than 100 Lines of Code: GPT-Index and Streamlit

Text to Video Generation using Videocrafter: Intuitive Math behind Latent Diffusion Model

Text to Video Generation using Videocrafter: Intuitive Math behind Latent Diffusion Model

Gamma AI: Create presentation PPT easily with #ai . #chatgpt #shorts #shortvideo #tech #coding

Gamma AI: Create presentation PPT easily with #ai . #chatgpt #shorts #shortvideo #tech #coding

Tripnotes: Free AI tools for your trip planning. #ai #chatgpt #shorts #youtubeshorts #video

Tripnotes: Free AI tools for your trip planning. #ai #chatgpt #shorts #youtubeshorts #video

Meet Bark (New Text to Speech Model): Clone Any Voice to Generate Music and Speech

Meet Bark (New Text to Speech Model): Clone Any Voice to Generate Music and Speech

Fliki: The free AI video creation tool. #ai #shorts #shortvideo #youtubeshorts #chatgpt #tech #news

Fliki: The free AI video creation tool. #ai #shorts #shortvideo #youtubeshorts #chatgpt #tech #news

Ask Anything Tool: Chat with Your Video using ChatGPT, MiniGPT4, and StableLM

Ask Anything Tool: Chat with Your Video using ChatGPT, MiniGPT4, and StableLM

HuggingChat: Open Source ChatGPT (Interface and Model)

HuggingChat: Open Source ChatGPT (Interface and Model)

This video teaches how to use a pre-trained image caption generator model in Google Colab and Hugging Face, and plans to build an API using FastAPI. The model uses transformers and Python, and the video provides a starting point for building AI-powered image captioning applications.

Key Takeaways

Install required libraries and models
Load pre-trained image caption generator model
Use Google Colab to test and deploy the model
Plan to build an API using FastAPI
Explore Hugging Face models and transformers

💡 Using pre-trained models and libraries like Hugging Face can simplify the process of building AI-powered applications, such as image caption generators.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

Common Next.js Errors (and How I Solved Them)

Learn to troubleshoot common Next.js errors and improve your development workflow

Dev.to · gary killen

Applying Scalability in Backend (CodeBuddy)

Learn to apply scalability in backend development for efficient system performance

Why Every Backend Developer Should Learn Nginx Before Going to Production

Learn Nginx to improve backend development skills and ensure smooth production deployment

Medium · DevOps

Connecting Frontend to Backend: A Backend Engineer’s Reality Check

Learn how to connect frontend to backend using Next.js by building a login and signup form for a dashboard

Medium · Programming

This Cop Was Held Accountable For His Brutality! #police #lawyer