AI agent + Vision = Incredible

AI Jason · Intermediate ·🛠️ AI Tools & Apps ·2y ago

Skills: Multimodal LLMs90%Agent Foundations70%

Key Takeaways

This video teaches how to build a vision-powered AI agent using autogen, llava, and stable diffusion

Full Transcript

This video is sponsored by SceneXplain, the leading image-to-text platform. What would happen when autonomous AI agent got GPT-4 vision power? It would allow us to build front-end agent. It can continuously iterating and improve the web design, answer complex questions that are not possible to be answered today, and even power general-purpose robots where it can make plan and take actions based on the camera image it is taken. We saw a lot of people start experimenting with ChatGPT vision, but it wasn't clear what the boundaries are. What kind of image task that it is doing really well today and what are the ones that's not. And how is prompting a multimodal model like GPT-4V different from other large language model that we have been using that only takes text. But Microsoft released a research paper that answered exactly those questions, where they tested hundreds of different image tasks with GPT-4V to understand what it is actually good at and what it is not. And also introduced some new prompting tactics. I will break this down for you so that you can understand what it is good at, what it is not, and how you can improve. And in the end, I'm going to show you a case study about how can you build an autonomous AI agent with vision ability today using AutoGen, Stable Diffusion, and Lava model that can continuously self-improve AI generated image. So, let's get it. Firstly, let's talk about what's the real power of multimodal large language model. And if you don't know what multimodal is, here is a quick explanation. The large language model that we have been using today only takes text input. It will take large amount of text data into vector, so the next time when someone prompting new text, it will start predicting the next words based on the vector space. A multimodal model, on the other hand, takes not only just text data, but also image, audio, and even video. Behind the scenes, it will try to tokenize all the different type of data and create a joint embedding so that it will understand these three type of data even though they are different format, they are similar information. And this unlocks some pretty crazy capabilities. For example, if you give the model an image of a park as well as the audio of dog barking, it can return a image that is relevant to the scenario. And if you give it image of your fridge, it'll be able to identify what are the items are and also come up a menu for it. So, this definitely opens up lots of new use cases. And the one that GPT-4V unlocks is the text plus image. GPT-4V can handle loads of different image types. It can easily understand a photograph. They can also understand text within the image very well. Even though some of text are distorted and hard to see like ones in those examples. It can also understand formulas, table, diagram, and even floor plan. And this is particularly exciting to me cuz there are tons of data that can't be easily digitized and communicate to AI at the moment. As many of the companies' knowledge base are PDF files like diagram and charts. And the part that I found most surprising is you can actually pass GPT-4V a few pages of documents. Like in this example, they actually feed GPT-4V six pages of a research paper. And the GPT-4V actually summarize the whole research paper really well. I'm pretty curious to understand how does this work behind the scenes. Does GPT-4V actually extract all the text data across page and then to understand or it is doing something different? Cuz this might help us build better the retrieval system. And on the other hand, GPT-4V not only just extract data from image, it actually understands those images. For example, it can recognize in this image the person is Biden and on the right side, this is CEO of Nvidia presenting a new product from Nvidia. Very likely to be a new GPU. Same kind of recognition ability has been demonstrated across landmark, food, brands, and logos. Even though some of logos are presented in distorted way. On the other hand, it can do tasks like counting the number of objects in the image and even do the reasoning. If you ask this question, is the person bigger than the car from this image? It will know that it is not because of the distance and perspectives. So, the overall out-of-box performance is really impressive, but it also makes tons of mistakes. For example, when we ask GPT-4V to extract data from this ID, it get everything right except hair is brown. And if you present a chart like this and ask it which year has the highest average gas price, it will give you the wrong answer. And it has a lot of trouble understand the speedometer for some reason. And when it is doing the objects counting tasks, if items is cut off, it can give you wrong answers as well. And what's interesting here, the some common prompting tactics that we have been using like let's think step by step. But for some reason, this didn't seem to impact image tasks much. Even though it break down task, it still give the wrong answers. But here are some prompting techniques that it tried and worked really well. But before we talk about prompting techniques, I want to introduce you to Sing Explain. While you might not have access to GPT-4V yet, Sing Explain is a great alternative that you can use it today. They provide powerful multi-model model that can do a wider range of different image tasks. You can turn image to a 40 fashion outer story or doing question and answer on image. And they also fine-tune the performance for specific image task like extract JSON from image. This allows me to upload thousands of product image and tell the model about specific information that I want to extract like product color, brand name, material, texture, and keywords. So from this example, it is able to tell that this is an earphone, the color is black, and the brand name is Sing Plus. And they can work with not only image but also videos. For example, I can upload this video clip from Dunkirk and it is able to give me the full story even though this video has no script at all because the model actually break down the video into different frames and understand the image of each single frame. This is very unique ability that I haven't seen on any other platforms. You can try out their model today either through their web UI or use the API endpoint to build vision powered AI apps. On the outside, you can also access it through the ChatGPT plugin by pasting an image or video URL and have multimodal ability in ChatGPT. They provide free credits for you to use and if you decide to get more credits, you can use my code AIGC to get 15% off for the first month. Thanks again for saying thanks for sponsoring this video. Now, back to visual prompting techniques. The first is text instructions. Even though when we ask it to think step-by-step, it didn't really work well, but when you give it very specific text instruction, it does help improve the performance. For example, in this specific task, it gave the text instruction to firstly explain the image and structure this image in a 2x2 matrix and then it gave specific steps about how this image task should be done. Look at first column to understand pattern and then look at the second column to guess the missing image here. And this returned the right answer. So, this is a first prompting techniques that you can add in detailed text instructions that explain the image and structure so GPT-4V had more context and then tell it the steps to complete this task. But sometimes just text instruction is not enough. In this object counting example, it's still guessing wrong even though you give very specific step instructions for the counting task. And this is where they introduce the second prompting techniques, condition on good performance. At default, GPT-4V only has a goal to complete task, but it didn't have a goal to complete task well. So, you need to explicitly tell it the expectation. They add two parts to the prompting. Firstly, they it tell GPT-4V that you are an expert on counting things in image and in the end to be sure we have the right answer. With these two things, it set up the expectation better to guide real the behavior. However, just these two things are not enough. It still gets things wrong for complex task like reading charts or reading the speedometers. No matter how many details you add into the prompts, it's still guessing wrong. I got same experience. I tried to get it extract text from a receipt and no matter how many details I add into the prompt, it just guessing wrong. And this is where the third and probably most powerful prompting techniques come in, which is few-shot prompts. We have been doing this with other large language model, where we would give a few examples of how this task should be done, so that the large language model can follow. And same thing apply for multimodal model as well. For speed meter reading tasks, they tried to give you examples. This is the first image and this answer, and now let's try to extract data from the second image. And what they found is with just one shot, it's still guessing wrong. But when it provides GPT-4V with two examples, it starts showing much better performance and guessing things right. And same thing happened for the chart reading example. When they just present one example, it's still guessing things wrong. But when you present two examples, the performance just start increasing dramatically. I found this particularly exciting, because this will allow us to fine-tune the GPT-4V with very low cost for specific image task. For example, any manufacturer can build your own defect detect system with some training data. And same thing for the medical diagnosis as well. So, you can train GPT-4V to do very specific image task, no matter how niche it is. And fourth prompting techniques is what they call visual referring prompting. This demonstrate GPT-4V's ability to understand visual annotation. You can have those arrow and circles, and GPT-4V will be able to understand which object you were talking about. The green pointer here actually point to both the floor, the desk, as well as the bottle. But it is able to understand the main thing here is actually the bottle. And this is also super exciting, because they were unlocking new type of interaction that wasn't possible before. In future, people can just simply circle or pointing to something, and GPT-4V would be able to understand. A very simple use case is this could be used for customer support, where customer can just simply circle on a error that he don't really understand, AI assistant would be able to help. So, those are few prompting techniques that you can use to improve the GPT-4V performance. But GPT-4V's power don't just stop there. It shows exceptional ability to take multiple image input, and it is able to understand the relationship between images. For example, if you presented two image, one is the menu with a price tag, and other is the image of the table with food, it is able to understand and calculate how much money you will need to pay for the things you ordered. And this is another example that I found particularly interesting. They presented three screenshots from a video clip of Big Ben, and asked if the task is opening the door, what would be the order of the image? And it would tell you it is A B C. But if the task is closing the door, it is able to tell you the order is just the other way around. And this is another example where it presented GPT-4V three separate image, and asked GPT-4V to understand the relationship between these four and try to assemble it. It is able to do it properly like this. And to be honest, this is a task that even me as a human would take a while to understand. And you can also send image of different people and tell you the name of each person. Then presented a new photo, and it would be able to understand that this photo just includes three of the person that has been given, and what's their names as well. And also able to transcript a video that don't have any script, because it can understand image as well as interpret what the person is doing based on a sequence of image. So, this is GPT-4V with all those amazing capabilities. I think it unlocks a few very exciting use case. One is we can finally build real knowledge base for certain industry like architecture, engineer, manufacturing. And on the because it's ability to understand the image and video data, I think this will enable search across multiple different types of data. For example, as a brand, I can simply search, "What are the videos my brand logo has been presented?" And on the other hand, with things like future prompt, you can totally build defect detecting system, as well as the medical diagnosis as well. But the part I'm most excited about is the agent ability. For example, we can actually get GPT-4V to critique the image generated from stable diffusion and then give feedback about how to improve. This can create agent that continuously improve the image generation result. And on the other side, they also showcase a very interesting example for some kind of desktop automation. They give GPT-4V a task to go find a detail recipient of map of those and GPT-4V will be able to complete a series task. Open Chrome first, search in Google, get the first results, click on a jump to receive button and then try to print out this receipt and get the final receipt. And same thing could be huge for robot as well. They simulate a robot task where it give a robot a camera image in the living room and ask to go to the kitchen and get something from the fridge. It will be able to tell that I need to turn right to move towards hallway first and when presented a new image, it will again find the path until it find the fridge. If it's real robot, then it can trigger action to open the fridge as well. So, I think this really unlock tons of use case for agents and I want to show you a quick example of how can we build a agent with vision ability today. Use image generation example. So, I'll create agent system that it will continuously improve the results of stable diffusion image generation by having one text-to-image engineer and one AI image analyzer with AutoGen, stable diffusion and Lava. Because we don't have access to GPT-4V API endpoint yet, I'm going to use Lava as an alternative. And if you don't know what Lava is, it is not a multimodal model that is based on Lava 2. It is not as good as GPT-4V but good enough for us to build this proof of concept. I actually made a videos for both AutoGen and Lava model. So, you can check them out if you want to learn more. And in terms of Lava and stable diffusion model, I will be using the hosted version on Replicate. So, Replicate is a platform that allow people to host their own AI models. Their pricing is a lot higher than other hosting platform like Hugging Face or RunPod but they do provide free credits. So, it's very easy to getting started if you want to build a proof of concept or your volume is low. But if your volume became really big, I definitely recommend to deploy your own model on RunPod. So, let's get started. I will use Visual Studio Code to implement this agent. So, let's create a new folder and I will open the terminal to install both AutoGen and Replicate if you haven't yet. So, we will do pip install pyautogen and replicate and click enter. And once you finish, also making sure you go to Replicate under Lava certain B models API tab and copy this code to import your Replicate API into your terminal here. And once you finish, let's create a one file called OAI config list. So, this is where you will import your OpenAI API key. It will look something like this with array and inside will be JSON model and API key here. And next, let's create a Python file called app.py. And we will firstly import a list of library that we're going to use. And then, import your OpenAI API key here to config list and create large language model config. And if you're not familiar with AutoGen, at high level, it is framework that allow you to create a multi-agent system. So, in our case, I will need to create a group chat of two different agents. One should be doing image generation and have access to stable diffusion. And another should be doing the image analyze and give feedback about how to improve the prompt with access to Lava model. And the way we will implement this is I will create a function to use Lava model to review the image and give feedback about how to improve the prompt. And second is I will also have another function to use stable diffusion model to generate image. And then after that, we will create our two different agents and create a group chat to start the conversation. So, firstly, use Lava model. So, the function is pretty straightforward because we simply use the Replicate API endpoint. So, this function will have two inputs. One is the file path of the image that it need to be analyzed, as well as the original image prompt. So, I will do replicate.run with a URL to this Lava model and two inputs. One is the image file, as well as the prompt. What is happening in the image from scale one to 10, describe how similar the image is to the original text prompt, and then we'll return the results. And we can quickly test this. So, I will create an image folder, and inside image folder, I will put in example image. And inside app.py, I'll call this image review function, pass on image file path, as well as prompt a parrot driving car, and try to print the results. So, I can save this, open terminal, and then run python app.py. Okay, cool. So, we get this results. Uh the large language model is giving a rating three out of 10 in terms of similarity. Because even though we have parrot, it is not a really driving a car. So, this is pretty good. We can use this as feedback to the prompt engineer to generate a new image. So, next thing is I will delete this two and create new function for text to image generation with the input prompt. And again, I will use replicate with their stable diffusion model, with input is prompt. And once we get the results, I will try to get the image URL first, and then try to download this image to my local computer, and give it file name with a current date, and also a shortened prompt. Because sometimes prompt can be really long, so I want to making sure it maximum 50 characters. And save this image under this image folder. So, I will do request.get with image URL. So, this should return the image file. And once we get the results, I will save this file to my local machine. And I'll try again as well to calling this function text to image generation a rabbit in a hat, and do python app.py. Okay, great. So, it returned this results of image URL, and also a new image in the folder, a rabbit in a hat. This is actually a great example that this rabbit is not in a hat, but it's wearing a hat. So, now we have the core functionality. The next, we simply need to stitch them together with the agent. So, I will create a new large language model config for assistance, because the assistant will need to have access to those function we defined with open AI function calling. I will create a JSON file with function, and inside array, I will have this two function defined. One is text to image generation, use the latest AI model to generate image based on prompt, return the file path of the image generated. And the input will be prompt, description will be a great text to image prompt that describe image. And same thing for image review as well. So, it will review, critique the AI generated image based on original prompt, and decide how can image and prompt to be improved. Inside here, I will have two inputs. One is the prompt itself, as well as image path. And in the end, I will do the config list and request time out 120. And after that, I will create a two agent. One is a text to image prompt expert. So, you are a text to image AI model expert. You will use text to image generation function to generate image with prompt provided, and also improve the prompt based on the feedback provided until the similarity is 10 out of 10. With large language model config equal to the new config that we created and function map. And same thing for the critique assistant. So, name is image critique. You are a AI image critique. You will be using image review function to review the image generated by text to image prompt expert. Again, it's original prompt. And provide feedback on how to improve the prompt. Give the large language model config assistance and function map. And after that, I will create a user proxy agent. And if you don't know what a user proxy agent is, that basically represents the user, which is you, in this group chat. So, that when you have feedback or when agent needs help, you have a way to provide feedback. And after that, I will create a group chat, which is like chat room for all those agent that you just created. And I will set max round equal to 50. So, it can run this iterative loop for a couple times. And then, I will create a group chat manager with this group chat and I will trigger message to the manager, "Generate a photorealistic image of llama driving a car." And I will do python.py. Okay. So, firstly, it send out this message, and the prompt engineer try to use text to image generation with this prompt. Okay. And once it's finished, you can see this is the first image it generated. The And the image critique agent start triggering the image review with both image path and this original prompt. And then it returns a response that it would rate similarity seven out of 10. Because while the image does have a llama in the car, it is not photorealistic and the llama is not really driving the car. So, the prompt engineer tried again with the new prompt and generated new image like this. Well, the llama is actually inside the car. And again, the image creator start reviewing. But for some reason, he thinks this image represent 10 out of 10 for llama driving a car. Uh but what's good about AutoGen is it will actually get back to the user proxy, which is me, and ask me for providing the feedback. So, here I tell the agents the llama is not really driving. It sure has its palm on the steering wheel. So, it go back to the prompt engineer and it generate a new image of this one. So, I would call this pretty good. Uh it has exactly the llama inside the car in a position and uh it has its palm on the steering wheel. If I want, I can actually continue getting going and iterating the image. So, this is one quick example of how an agent with vision capability. And there are a lot more different types of agents you can actually start building now. I'm also interested to do a version of agent that can do more sophisticated tasks like browser automation. So, please comment below about what type of AI agents that you want to see me building. I'll continue sharing interesting AI projects that I'm building. So, please consider giving me a subscribe if you're interested. Thank you and I'll see you next time.

Original Description

A step by step tutorial of how to build vision powered AI agent via autogen + llava + stable diffusion AND Break down of 160-page analysis of GPT4V capabilities 🤘 Get 15% off on sceneXplain via my code AIJASON : https://go.jina.ai/scenexplainjason 🔗 Links - Follow me on twitter: https://twitter.com/jasonzhou1993 - Join my AI email list: https://www.ai-jason.com/ - My discord: https://discord.gg/eZXprSaCDE - sceneXplain: https://go.jina.ai/scenexplainjason - Vision-agent Github: https://github.com/JayZeeDesign/vision-agent-with-llava ⏱️ Timestamps 0:00 Intro 1:15 What is multi-modal model 2:12 GPT4V ability break down 4:34 sceneXplain 6:00 Visual prompt techniques 10:53 Use cases 13:00 Build vision agent #1 - Setup 14:20 Build vision agent #2 - Use Llava model 15:58 Build vision agent #3 - Use Stable diffusion 16:52 Build vision agent #4 - Set agent system via autogen 18:53 Build vision agent #5 - Demo 👋🏻 About Me My name is Jason Zhou, a product designer who shares interesting AI experiments & products. Email me if you need help building AI apps! ask@ai-jason.com #gpt4 #autogen #autogpt #ai #artificialintelligence #tutorial #stepbystep #openai #llm #chatgpt #largelanguagemodels #largelanguagemodel #bestaiagent #chatgpt #agentgpt #agent #babyagi #llava #stablediffusion

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AI Jason · AI Jason · 23 of 60

← Previous Next →

Build Your Own Auto-GPT Apps without coding Step by Step (Dust.tt Tutorial)

Build Your Own Auto-GPT Apps without coding Step by Step (Dust.tt Tutorial)

AutoGPT tutorial: Build your personal assistant WITHOUT code (Via Relevance AI)

AutoGPT tutorial: Build your personal assistant WITHOUT code (Via Relevance AI)

Create your own AI girlfriend that talks ❤️

Create your own AI girlfriend that talks ❤️

How to build with Langchain 10x easier | ⛓️ LangFlow & Flowise

How to build with Langchain 10x easier | ⛓️ LangFlow & Flowise

I build an autonomous researcher via GPT | Langchain ⛓️ Tutorial

I build an autonomous researcher via GPT | Langchain ⛓️ Tutorial

Smol AI tutorial in 5 mins | Build ENTIRE codebase with a single prompt

Smol AI tutorial in 5 mins | Build ENTIRE codebase with a single prompt

Hugging Face + Langchain in 5 mins | Access 200k+ FREE AI models for your AI apps

Hugging Face + Langchain in 5 mins | Access 200k+ FREE AI models for your AI apps

How to let GPT control anything & 10x powerful | 8 mins tutorial about GPT funtion calling

How to let GPT control anything & 10x powerful | 8 mins tutorial about GPT funtion calling

Extract data & automate EVERYTHING | 10x GPT function calling power

Extract data & automate EVERYTHING | 10x GPT function calling power

Finally, an AI agent that actually works

Finally, an AI agent that actually works

"okay, but I want GPT to perform 10x for my specific use case" - Here is how

"okay, but I want GPT to perform 10x for my specific use case" - Here is how

"Wait..this AI Agent does research for you 24hrs without hallucination?!" - Here is how

"Wait..this AI Agent does research for you 24hrs without hallucination?!" - Here is how

"How to give GPT my business knowledge?" - Knowledge embedding 101

"How to give GPT my business knowledge?" - Knowledge embedding 101

“Automation 2.0 coming…No more boring data entry job”

“Automation 2.0 coming…No more boring data entry job”

"How to 10x chatbot UX? 🤖 🖼️ " - Add Image Responses to GPT knowledge retrieval apps

"How to 10x chatbot UX? 🤖 🖼️ " - Add Image Responses to GPT knowledge retrieval apps

“LLAMA2 supercharged with vision & hearing?!” | Multimodal 101 tutorial

“LLAMA2 supercharged with vision & hearing?!” | Multimodal 101 tutorial

"Next Level Prompts?" - 10 mins into advanced prompting

"Next Level Prompts?" - 10 mins into advanced prompting

Build AI agent workforce - Multi agent framework with MetaGPT & chatDev

Build AI agent workforce - Multi agent framework with MetaGPT & chatDev

How to scale your AI automation pipeline

How to scale your AI automation pipeline

AI agent manages community 24/7 - Build Agent workforce ep#1

AI agent manages community 24/7 - Build Agent workforce ep#1

Autogen - Microsoft's best AI Agent framework that is controllable?

Autogen - Microsoft's best AI Agent framework that is controllable?

StreamingLLM - Extend Llama2 to 4 million token & 22x faster inference?

StreamingLLM - Extend Llama2 to 4 million token & 22x faster inference?

AI agent + Vision = Incredible

AI agent + Vision = Incredible

After 7 days letting AI agents control my email inbox... 📮

After 7 days letting AI agents control my email inbox... 📮

How to use New OpenAI DevDay features - GPT4V x TTS demo tutorial

How to use New OpenAI DevDay features - GPT4V x TTS demo tutorial

What is Q* | Reinforcement learning 101 & Hypothesis

What is Q* | Reinforcement learning 101 & Hypothesis

"Research agent 3.0 - Build a group of AI researchers" - Here is how

"Research agent 3.0 - Build a group of AI researchers" - Here is how

GPT4V + Puppeteer = AI agent browse web like human? 🤖

GPT4V + Puppeteer = AI agent browse web like human? 🤖

Real Gemini demo? Rebuild with GPT4V + Whisper + TTS

Real Gemini demo? Rebuild with GPT4V + Whisper + TTS

AI Robot's ChatGPT moment at 2024?

AI Robot's ChatGPT moment at 2024?

GPT5 unlocks LLM System 2 Thinking?

GPT5 unlocks LLM System 2 Thinking?

The REAL cost of LLM (And How to reduce 78%+ of Cost)

The REAL cost of LLM (And How to reduce 78%+ of Cost)

OpenAI's Agent 2.0: Excited or Scared?

OpenAI's Agent 2.0: Excited or Scared?

Real time AI Conversation Co-pilot on your phone, Crazy or Creepy?

Real time AI Conversation Co-pilot on your phone, Crazy or Creepy?

INSANELY Fast AI Cold Call Agent- built w/ Groq

INSANELY Fast AI Cold Call Agent- built w/ Groq

AI Employees Outperform Human Employees?! Build a real Sales Agent

AI Employees Outperform Human Employees?! Build a real Sales Agent

Future of E-commerce?! Virtual clothing try-on agent

Future of E-commerce?! Virtual clothing try-on agent

Unlock AI Agent real power?! Long term memory & Self improving

Unlock AI Agent real power?! Long term memory & Self improving

"I want Llama3 to perform 10x with my private knowledge" - Local Agentic RAG w/ llama3

"I want Llama3 to perform 10x with my private knowledge" - Local Agentic RAG w/ llama3

“Wait, this Agent can Scrape ANYTHING?!” - Build universal web scraping agent

“Wait, this Agent can Scrape ANYTHING?!” - Build universal web scraping agent

"Make Agent 10x cheaper, faster & better?" - LLM System Evaluation 101

"Make Agent 10x cheaper, faster & better?" - LLM System Evaluation 101

Claude 3.5 struggle too?! The $Million dollar challenge

Claude 3.5 struggle too?! The $Million dollar challenge

Make your agents 10x more reliable? Flow engineer 101

Make your agents 10x more reliable? Flow engineer 101

"I want Llama3.1 to perform 10x with my private knowledge" - Self learning Local Llama3.1 405B

"I want Llama3.1 to perform 10x with my private knowledge" - Self learning Local Llama3.1 405B

AI process thousands of videos?! - SAM2 deep dive 101

AI process thousands of videos?! - SAM2 deep dive 101

"Wait, I'm using OpenAI Structured Output wrong ?!" - Advanced Structured Output tutorial

"Wait, I'm using OpenAI Structured Output wrong ?!" - Advanced Structured Output tutorial

How to use Cursor AI build & deploy production app in 20 mins

How to use Cursor AI build & deploy production app in 20 mins

Best Cursor Workflow that no one talks about...

Best Cursor Workflow that no one talks about...

This is how I scrape 99% websites via LLM

This is how I scrape 99% websites via LLM

Better than Cursor? Future Agentic Coding available today

Better than Cursor? Future Agentic Coding available today

EASIEST Way to Train LLM Train w/ unsloth (2x faster with 70% less GPU memory required)

EASIEST Way to Train LLM Train w/ unsloth (2x faster with 70% less GPU memory required)

1000x Cursor workflow for building apps

1000x Cursor workflow for building apps

Easiest way to build fancy UI with Cursor/Windsurf/Bolt/Lovable

Easiest way to build fancy UI with Cursor/Windsurf/Bolt/Lovable

From $0 to $4m with just 2 people (ComfyUI Crash-course for E-commerce)

From $0 to $4m with just 2 people (ComfyUI Crash-course for E-commerce)

Deepseek R1 - The Era of Reasoning models

Deepseek R1 - The Era of Reasoning models

Yep, o3-mini is WORTH the money - Build your own reasoning agent

Yep, o3-mini is WORTH the money - Build your own reasoning agent

The ONLY way to run your own Deepseek on mobile...

The ONLY way to run your own Deepseek on mobile...

Those MCP totally 10x my Cursor workflow…

Those MCP totally 10x my Cursor workflow…

MCP = Next Big Opportunity? EASIST way to build your own MCP business

MCP = Next Big Opportunity? EASIST way to build your own MCP business

Gemini 2.0 blew me away - The future of Multimodal Model

Gemini 2.0 blew me away - The future of Multimodal Model

More on: Multimodal LLMs

View skill →

Google Veo 3 Tutorial: How to create AI Videos in Flow, Gemini or Google Vids?

Google Veo 3 Tutorial: How to create AI Videos in Flow, Gemini or Google Vids?

AI Tool Journey

NVIDIA Clara Guardian Virtual Patient Assistant

NVIDIA Clara Guardian Virtual Patient Assistant

NVIDIA Developer

Building Multimodal Search and RAG

Building Multimodal Search and RAG

Midjourney Trick: Consistent Character in Different Images

Midjourney Trick: Consistent Character in Different Images

Ollama Multimodal: EASILY setup Llava locally & Integrate API

Ollama Multimodal: EASILY setup Llava locally & Integrate API

The ONLY Real Time Speech AI that can run locally!!!

The ONLY Real Time Speech AI that can run locally!!!

Related Reads

Three ranking currencies and zero overlap: what 2025 Juejin AI roundups actually disagree about

Discover the three incompatible ranking currencies in Juejin's 2025 AI tool roundups and their implications for AI tool evaluation

How to Use Poe for Case Studies in 2026

Use Poe to access multiple AI models for efficient case study content creation

10 Ways to Make Money Using AI Tools in 2026

Learn how to leverage AI tools to generate income in 2026 and explore new opportunities for financial growth

I got tired of switching AI SDKs every time I wanted to try a new model

Simplify AI model integration by building a unified API, reducing switching costs between different AI SDKs

Dev.to · zhongqiyue

Chapters (11)

Intro

1:15 What is multi-modal model

2:12 GPT4V ability break down

4:34 sceneXplain

6:00 Visual prompt techniques

10:53 Use cases

13:00 Build vision agent #1 - Setup

14:20 Build vision agent #2 - Use Llava model

15:58 Build vision agent #3 - Use Stable diffusion

16:52 Build vision agent #4 - Set agent system via autogen

18:53 Build vision agent #5 - Demo

how i use a.i. to create viral UGC influencer facebook ads.