Image to Text Prompt: Reverse Engineering AI Image Generation
Key Takeaways
This video demonstrates reverse engineering AI image generation using the Clip Interrogator library, built on OpenAI's CLIP and Salesforce's BLIP, for image-to-prompt generation.
Full Transcript
hello everyone welcome to AI anytime channel so in today's video we are going to work on a very interesting problem now you have already seen a lot of image generation model you know generative AI models that where you perform some kind of prompt and then you get an image as an output we already have seen you know the way we can generate images based on Prom now just imagine if you want to generate prompt from an image now you have an AI generated image and now you want to understand what kind of prompt people have you know made previously to generate that image can we reverse engineer the entire process is there any way that we can reverse engineer stable diffusion or is there any way we can reverse engineer mid Journey or dolly for example this is a problem that we are going to tackle today we are going to explore if there is any way if there is any possibility that we can you know do this reverse engineering or stable diffusion for example this is a challenge that we are going to you know work on and we will use something called clip again open AI to rescue we are going to use clip which is our acronym for contrastive language image pre-training okay we are going to you know work with this model and you know there have been a tool which has been built on top of this model now if you see on my screen I have an image of Mona Lisa which I have you know generated uh with the help of stable diffusion okay and there have been multiple prompts or the iteration and the prompts base was chair GPT so I asked chair GPT you know suggested me the prompt and then I use that prompt to generate this image now I'll use this as a reference image I can also use some of the other images that I have in my directory and we are going to have a look at something called Cliff which as already I said it's con contrastive language image pre-training so it's an efficient model you know that learns visual concepts from natural language supervision it's a it's a very interesting fact that you know clip has been developed by you know combining research on zero sort learning and natural language processing now if you're not familiar with what zero sort learning is so basically zero thought is a technique you know that that allows machine learning model to reorganize new objects without having to be trained on example of those objects so clip has been trained you know uh with with for that purpose basically okay so it generalizes better on unseen objects and it is very robust now if you come down you know I am currently on the open AI you know see uh on this website where I'm having a look at clip okay which says it's a neural network of course it is called clip which efficiently learns visual concept from natural language supervision it has this natural language supervision that's how it learns clip can be applied to any visual classification Benchmark and similar to the zero sort capabilities of gpt2 and gpt3 now we are living in this world of zero sort learning we have few sword we have zero sort learning this is why this models are being generalized better on in a vast variety of data that they have or the data mediums if you come down no it also has this benchmarking with resonate one not one and clip vitl now what is v i t l it's an acronym for vision Transformer and this is the language this is the acronym for this you have seen the benchmarking has been done over here okay on how clip has performed better on this imaginated data set for example or any other data set that has been there okay it says an accuracy on imaginary test set you can read this if you come down they also have this you know high level architecture it's very fascinating to understand it's a it's they have separated this in three steps the first if you read it says it pre-trains an image and text encoder here you know to predict which images we are paired with which take so whether image pair with which text pair and then we they use a behavior to turn clip into a zero sword classifier if you see it over here it says they have this zero sword classifier over there and it's very interesting to read this paper they have a lot of uh benchmarking done on different data sets that you can have a look they also have the advantages of clip which is very robust they also have a metric that you know it generalizes it better on on Sim unseen objects and it has performed better okay now based on this clip there have been you know multiple tools which has been built later uh later on and one of the tool is called clip interrogator and this is this is what we are going to use for this purpose we have this Mona Lisa image or any other AI generated images we'll try to pass this AI generated images and we'll use this clip interrogator tool uh you know to work with now if I if I come back on this I'll give this uh this uh research paper Link in the description you can have a look at it now what I will do I'll use Google collab you know for this task where I will have an image as an input and then I will try to generate the respective prompt or that that's basically a detected prompt this is a prompt that is being detected or you know generated by uh the AI model okay or the tool that we are going to use the clip interrogator tool which has been built on top of a click model now just understand the use cases of this right this kind of solutions uh help in you know reproducing the image now you want to reproduce the image the stable diffusion image that we have or any other AI generated image it can also help you reproducing it now it also help you as a response validator now we we have been talking about prompt Engineers can we also talk about response validator the response we are getting how are we going to validate those response this is also an example and it also it also help the end user or the company's organization to detect AI generated image once you have the prompt you can use some kind of you know rule based engine or you know regular expression so there can be anything to to detect some possibilities whether an image is being generated by AI or not this is what we are going to do guys you know I will use Google collab here I'm just spinning up a notebook you know uh from Drive and I will use collab notebook a free GPU uh which which is being provided by you know Google this is what I'm going to do here so let me just rename it quickly so I'll call it reverse HD for example reverse stable diffusion this is my collab notebook name I'll just change the runtime it's still connecting I'll I'll click on this runtime I will click on change runtime type and I will click GPU here and I will just click on Save and it's connecting uh to this runtime currently it says connecting initializing you know Python 3 Google compute engine back end it uses gcp in backend so I'm doing that and it will take you can see it's connected I can also Mount to my driver I don't know if this makes sense to mount this but anyway let me just mount it with my Google Drive and this is what I'm going to use guys so the clip interrogator tool that we are going to use clip interrogator has been developed with the help of you know open AIS clip and something called blip which is by Salesforce now let me just write it what we are going to do here guys so we are going to use this excuse me let me just make it up reverse stable diffusion so this reverse stable division in in this one what we are going to do we are going to take help of open AI clip open AI script and a framework that has been developed by Salesforce this is the uh underlying tick that we are going to use by the way so if I just click on this the tool that we are going to use basically as a python library is clip interrogator this is what we are going to do guys okay in this tutorial so now you see my runtime is connected okay I am in this notebook now what I have to do I have to install this pip install clip interrogator so I'll just do pip install clip clip in interrogator and the version that I need is 0.6.0 let's install this it says invalid requirement you can say it's installing it uh pip install clip interrogator and the dependencies are if you want to run it in your local system if you don't want to run it on Google collab where you can see pytorch and hugging with a lot of libraries are you know installed by default in collab you know you need Transformers you need torch you need sentence space to work with this library that I am using clip interrogator 0.6.0 and if you come down okay now we have installed it successfully what we will do let's import couple of uh libraries here the first that we need below so I'll say import I am I should be Capital Inge this is the first thing then I need the clip underscore interrogator so from clip and underscore interrogator import we need two thing the first is config the first is config and the second is the interrogator class so I'm just going to uh you can see I have successfully imported so these are the uh dependencies that we need okay to do the image to prompt so we'll upload an image and input and then we'll uh generate the prompt out of it okay we'll see what kind of prompt that has been used you know to generate that input image this is what we are going to do uh now so let's Define a variable it up let me first upload this image so what I'm going to do I'm going to upload couple of images the first image is this monally so let me also import this HD sample and I will I'm going to upload this in this current runtime you can also mount it as I have mounted it you can just uh import it load it from drive as well your drive folder but I just do explore this tool so image dot open in image.open I will pass this uh image so the first is Mona Lisa Dot uh PNG and then I have to I think it's already Reb but you can also convert it to RGB so you have to convert them image to RGB most of the time we generate you know uh image which are not RGB as well so we can also do that we can convert it to RGB if you are also getting an error you can also have a look at maybe if you know when I was using this tool when I was trying to integrate this with gradu and streamlit I was getting error maybe you can use something called you know uh image or two array as well you know in that case if you are getting any error uh while using this now what I will do I'll just uh let me first show the image so I'll just do image it will uh show the image so this is the image that I have loaded uh very beautiful image right very beautiful image generated image of Mona Lisa you know by stable diffusion so now what I'm going to do I'm just going to call it a clip interrogator and I will just do interrogator the class that we have you know uh the class that we have imported so one second interrogator this is my class interrogator so in this what I have to do now I have to use that config so I'll use config and in config method I will load this clip model so my clip model and it's I have to give a name so it's called clip model name and that model name you have to use the model which is Again by the same opening either it's a vision you know it's a vision Transformer model that I'm going to use so it's called v i t e where I is small and L and the 14 and I'm going to do it open AI where it will load from open AI let me just load this it says it say it says loading caption model bleep large so we are you know loading the large model that you see vitl so Vision Transformer large I this can be basically the checkpoint version okay I'm not sure about this uh number right now so Vision Transformer large 14 from open AI it's you know downloading all the model weights and the checkpoints here it will be here in your local cash runtime it will be cached here so let it load it will take little time now if I go back you know here on this you can see the comparison that has been you know done with clip Vision Transformer large model with uh resnet 101 which is already a very large model and they have compared that and there's a benchmarking you can see the increase in performances you know on multiple uh data set here on the test data set this is a very very very fundamental uh paper to read as well I'll share this link in description and you can see it's now loading the click model vit Vision Transformer large open Ai and it will load so let me write the next set of course so once we you can see you can successfully have loaded in 35 for 35 seconds half a minute now what we'll do now we'll just pass this image okay so what I'm going to I'm going to say result so result equals CI dot here I'm going to use this interrogate method so CI dot interrogate you can see this is a method that I am going to use and in this I will pass my input image the image that I have loaded of Mona Lisa which is which I have stored in this image variable so what I'm going to do I'm going to just pass this image variable here and let me just run this and guys it it will take up to you know from few seconds up to a minute depending upon your uh memory and RAM of course and you have to run this on a GPU machine if you are trying to run a CPU machine you might get you know out of memory error okay it will give you this memory out of index or something and uh this is I am using it and you can see you can just see it below it's now almost 30 seconds but it will uh it will perform the inference on the model that we have loaded on top and then we can print the result to see what kind of result we are getting you can see but you know you can see it's loading it over here you can also keep a check on this uh RAM here it says you can see the ram that has been utilized I have very less Ram okay on this one which makes sense I'm using a free uh collab guys it's a free note a free GPU given by Google collab okay I'm using it I think they also provide 15 GB if I'm not wrong space as well so it might take up to a minute or two you can see it over here so let me come back once this is done okay I'll just pause the video here so you can see the result has been successfully completed the sale has been successful uh once we run now let's print this result and see what kind of you know response we are getting are we getting any prompt and this is this is the result guys it says you know of image of a painting of a woman with a candle rendered on unreal 3D engine that's an engine by the way painting of Mona Lisa and I'll just see how superb is this right you have an input image and you are getting a prompt out of it and this is so so real so accurate you know you have a painting of Mona Lisa you have this Unity render invisible women by pinturioshio Nvidia promotional image this is an artist you know Style by the way Nvidia promotional image you know aquiline features you know DaVinci again the artist depth blur this is a prompt we are getting guys so this was an input image you know in our case this was a Mona Lisa image that I have generated using stable diffusion you know with the help of chair GPT prompts and you know when I'm passing it to it you can see the response that I'm getting okay here on this one now it doesn't matter if you are just uploading a stable diffusion or a dolly or a mid Journey just upload an AI generated image because this can also help you to detect you know whether an image has been directed by or an AI or node this is very helpful for reverse engineering that process okay if you want to reverse engineer that process if you want to detect some AI generated images on internet or on your application or board some kind of content you can also do that okay it also helps you reproduce now you take this prompt reproduce with you know some other images right and this is what this lot of tools you see currently I am on this prompt hero which provides you different kind of you know example or samples for prompt now what I will do guys let me just do one thing let me just try to change this image and I said I have changes to HD sample.jpg the other image and by the way this is a compute heavy model guys okay requires high computational power you require a GPU to run this model I don't know if you have a quantized model available if you have any quantized model where people have quantized this you know to different you know float or something that you can try it out and see if it works faster on a CPU machine I haven't tried it yet so what I will do now I will just you know try to run this so I don't know why I did this let me just run this once more so let me just load the image so I have loaded the image and now I'm sewing the image by you know printing this image you can see this is my image now I'll just pass it again I don't know if I need this to load I don't need this so just do result CI dot intro get image you can see it's it's performing the inference on it so we are able to make the inference on that particular image let me just see my Ram it should not exceed that because otherwise I have my session will crash on Cola that's the problem of collab but there are work around you can also manage that but I think this should work now let's see right you can see there are a lot of application that I can see now you know you can simply you can create a gradu or stimulate application you know you can create a graduate application directly from this Google collab and share with your friends and to your peer to you know evaluate this model what kind of responses you are getting a more you test but better you will know about this model you know robustness and of course the accuracy as well so but it's very real you know it's it's not the way you say 100 that will not be there of course but you know it works way now let's me print this result and you can see right it says a close-up of a man with a beard and a jacket and oh no this is fantastic right and art style great uh rutkovski you know made in 1800s Discord profile picture wow this is amazing we have an artist named Edmond Blair and Charlie Powter you know we have you know nationalist this is this is so good guys you know see this realistic Masterpiece now let me just you know copy this and let me open my I don't know maybe I'll open the sticky notes where I'll just paste it okay meanwhile it's getting ready let me open a notepad Okay I don't want to wait for that long so let me just paste it over here see this is my uh this one now let me just go back on this pictures I have this HD image now you can see this this this is the image that I uploaded as an input and now we have got this prompt so what we are doing here guys we have we have done a reverse engineering you know of prompt to image or text to image now what we are doing we are having image to prompt generated AI image or generated image with the help of AI this is the right statement by the way so generated image with the help of AI to prompt so we are we have done this prompt you know reverse engineering of this entire process and this is extremely powerful these have a lot of huge cases as I discussed earlier you know in the beginning of the video you you can use this you can build an application or build an API you know that will take an image as an input and it will generate a prompt now you can use this to detect AI generated images a lot of applications already or startups already working on those areas they have built a platform they have built a software as a service application where you know you use you upload an image and they will give you this output whether they are being related by a stable diffusion mid Journey dolly what kind of prompt that has been you know used to generate that image already there have been applications in industry in community you can also do the same you can use this to reproduce you can use this to you know detect uh an AI generated image explicit contained any kind of content right you can do this keyword matching or similarity try to find out you know it's extremely popular it's up to you how you extend this further but I just wanted to you know so this now let me also bring uh this notebook I'll share this notebook in the description it all it will be available on AI anytime GitHub repository maybe in the coming video I will create a graduate application or estimate application I can deploy this you know I will try to also containerize this now it totally depends on you you can create an API out of it Go and host it or deploy it on rapid API and you know you can also make money out of it okay but don't forget to give credits to the people or the uh the community who has built this fantastic tool called clip interrogator you know you can find it out on pipe or GitHub there there will be repositories and this is what you know we did today in this video guys we have reverse engineered the stable diffusion okay so we have you know done image to prompt rather than prompt to images okay and this is very interesting I wanted to you know work in these areas and and very happy with the response we are getting very happy with the performance of the clip and blink model so click by open Ai and blip by Salesforce is working extremely well you know when we combined this with help of this tool you know that they have built clip interrogator you know which is available on Pipi and GitHub by the way please give them credit and please cite them as well uh kudos to them so this is you know this is extremely well you know a lot of use cases there now please let me know what you do with this uh you know code base or if you extend this tool further you know please let me know please share your thoughts and feedbacks with me okay that's all for you know this video guys okay I hope you are liking the content if you if you like my content please you know uh share the Channel with your friends and to your clear with to your peer if you haven't subscribed yet do subscribe the channel okay that's all see you in the next video guys thank you
Original Description
This video explores the fascinating world of reverse engineering the AI image generation process. We showcase several examples of image-to-prompt generation by leveraging the "Clip Interrogator" library developed on top of OpenAI's CLIP and Salesforce's BLIP. This kind of application helps in reproducing the image, detecting AI-generated images, etc.
Join me on this exciting journey of "Image to Text Prompt" and discover how to use cutting-edge AI techniques to create stunning content.
CLIP Link: https://openai.com/research/clip
Salesforce BLIP Link: https://github.com/salesforce/BLIP
#midjourney #ai #chatgpt
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from AI Anytime · AI Anytime · 35 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
▶
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Spelling and Grammar Checking Streamlit App: Building Docker Image
AI Anytime
Spelling and Grammar Checking Streamlit App: Docker Image and Docker Hub
AI Anytime
Image Caption Generator: Google Colab and Hugging Face
AI Anytime
Low Code/No Code AI Platform Teachable Machine: Brain MRI Image Classification
AI Anytime
Low Code/No Code AI Platform Teachable Machine: Testing the Model
AI Anytime
Low Code/No Code AI Platform: Streamlit App for Brain MRI Image Classification
AI Anytime
Readme Generator Streamlit App using ChatGPT
AI Anytime
Generate Minutes of Meeting (MoM) from Video using ChatGPT: AI as an API
AI Anytime
The Great AI Showdown: ChatGPT vs ChatSonic 🔥
AI Anytime
Generating Transcripts and News Article with Whisper, GPT-3.5, ChatGPT and Streamlit
AI Anytime
Toxicity Classifier using Machine Learning and NLP
AI Anytime
Toxicity Classifier API using FastAPI
AI Anytime
Toxicity Classifier Streamlit App
AI Anytime
Low-Code Insurance Prediction with PyCaret and Streamlit
AI Anytime
Deploy Streamlit Python Application for Free
AI Anytime
GPT3 Powered Text Analytics App
AI Anytime
AI Image Generation Streamlit App
AI Anytime
Streamlit and txtai: Building an Abstractive Summarization App in Python
AI Anytime
Building a Topic Modeling and Labeling app with Streamlit
AI Anytime
The Art of AI: Exploring Midjourney, Dall-E, and Lexica
AI Anytime
Exploring the latest Large Language Models (LLaMA and Alpaca)
AI Anytime
Comparing LLMs like GPT-X, LLaMA, and Alpaca: Analyzing the Perplexity Score
AI Anytime
GPT-3 powered Q&A App using Langchain, GPT-Index, and Gradio
AI Anytime
All things #ai . Latest and greatest in AI. #tech #python #chatgpt #youtubeshorts #shorts #gpt3
AI Anytime
Text-to-Video Generation using a Generative AI Model
AI Anytime
#ai brand name generator. #artificialintelligence #tech #shorts #youtubeshorts #youtube #chatgpt
AI Anytime
Talking AGI with Sam Altman: A Deepfake Showcase
AI Anytime
A conversation with ChatGPT creator Sam Altman. #tech #technology #ai #shorts #viral
AI Anytime
Get to Know Anthropic's Claude: The Ultimate ChatGPT Competitor
AI Anytime
#shorts #chatgpt #python #datascience #tech #coding
AI Anytime
Recipe Generator App from Cooking Videos using Whisper and ChatGPT
AI Anytime
Segment Anything Model by Meta AI: An Image Segmentation Model
AI Anytime
One of the best #ai #books based on #tensorflow. #tech #coding #shorts #chatgpt #machinelearning
AI Anytime
Music Generation using Mubert #ai . #music #shorts #youtubeshorts #chatgpt #generativeai
AI Anytime
Image to Text Prompt: Reverse Engineering AI Image Generation
AI Anytime
Image Generation for #ramadan using #ai. #midjourney #chatgpt #shorts #youtubeshorts #islam
AI Anytime
How to build an AI-ready organization: Cultivating a Data-Driven Culture
AI Anytime
Midjourney: Generate AI-powered Images
AI Anytime
Getting Started with Graphs: A Beginner's Guide (Part 1 of GNN Series)
AI Anytime
Build India's First ChatGPT like App for Politics: BJP-GPT
AI Anytime
Meet BJP-GPT.... @AIAnytime #bjp #news #shorts #tech #chatgpt #ai #youtubeshorts #coding #video
AI Anytime
ChatPDF... #chatgpt for PDF files. #ai #generativeai #shorts #youtubeshorts #coding #tech #ai
AI Anytime
Free AI Image Generation #ai #chatgpt #coding #tech #shorts #youtubeshorts #shortvideo #generativeai
AI Anytime
Transform old photos into Vibrant Memories with Deoldify AI: Build a Streamlit App
AI Anytime
Open Assistant: The Real Open-sourced LLM
AI Anytime
Thanks to @YannicKilcherand team for the open sourced LLM Open Assistant. #ai #shorts #tech
AI Anytime
Search Engine for AI generated images. #ai #tech #technology #generativeai #chatgpt #shorts #video
AI Anytime
Generative AI Video Platform "Synthesia" #shorts #youtubeshorts #ai #tech #chatgpt #generativeai
AI Anytime
Text to speech Voice AI platform. #shorts #youtubeshorts #ai #tech #technology #python #coding
AI Anytime
Create Amazing Videos with ChatGPT and Pictory: Free AI-powered Video Creation
AI Anytime
Want to create beautiful video using #chatgpt and #pictory ? Watch the tutorial on channel. #ai
AI Anytime
Animate your photos using AI. Bring old family photos to life. #ai #tech #shorts #shortvideo #coding
AI Anytime
Create a PDF Search and Summarization Tool in less than 100 Lines of Code: GPT-Index and Streamlit
AI Anytime
Text to Video Generation using Videocrafter: Intuitive Math behind Latent Diffusion Model
AI Anytime
Gamma AI: Create presentation PPT easily with #ai . #chatgpt #shorts #shortvideo #tech #coding
AI Anytime
Tripnotes: Free AI tools for your trip planning. #ai #chatgpt #shorts #youtubeshorts #video
AI Anytime
Meet Bark (New Text to Speech Model): Clone Any Voice to Generate Music and Speech
AI Anytime
Fliki: The free AI video creation tool. #ai #shorts #shortvideo #youtubeshorts #chatgpt #tech #news
AI Anytime
Ask Anything Tool: Chat with Your Video using ChatGPT, MiniGPT4, and StableLM
AI Anytime
HuggingChat: Open Source ChatGPT (Interface and Model)
AI Anytime
More on: LLM Engineering
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
FREE AI Sin City Photo Generator — Turn Any Photo Into High-Contrast Noir Art (2026)
Dev.to AI
Google makes Gemini’s personalized image generation free for all US users
The Next Web AI
Gemini’s personalized AI image generation is now free for U.S. users
TechCrunch AI
WebP's Compression Secret: How a 1MB PNG Becomes a 200KB WebP
Dev.to · swift king
🎓
Tutor Explanation
DeepCamp AI