“LLAMA2 supercharged with vision & hearing?!” | Multimodal 101 tutorial
Key Takeaways
The video explores Multimodal language models like LLAMA2, enabling users to reach GPT4 level multimodal abilities, and unlock use cases like chat with images, using tools such as LLAMA2, GPT4, and Google Palm 2. The tutorial covers the basics of multimodal AI, generative AI, and large language models, including joint embeddings and shared representations.
Full Transcript
recently large language model like open Ai gpt4 and Google Palm 2 also incredible results about integrating visual inputs and text to perform multimodal tasks yes you hear me right multimodal this kind of next Frontier of generative AI what that means is unlike large language model which takes text input turns them into Vector embedding so that it can understand the relationship between different words and use it to predict the next word coming out of sentence multi-model models can take more than text inputs like image video audio or any type of data really behind the scenes it took a nice different type of data and somehow created joint embedding so that it had shared representation space that captured information from text image video audios and those shared representations enable it to solve problems and run reasonings across different type of data for example you can take a photo of your fridge and ask the model what kind of mails you can cook with all those leftovers it will be able to understand what kind of foods you actually have in the image generate a recipe based based on those information and you can also do some really Advanced generation if you give an image of a grass and also a audio it will be able to generate or find a image with both stocks and grass as elements during the gpd4 demo open AI also showcase ability where it can turn a wireframe sketch like this into a functional HTML website because it can understand image extract core information to complete various tasks so far major large Lounge model like upd4 hasn't released any multimodal feature yet so most of us haven't got a chance to experience the power but there's one multi-model released recently called lava which represents large language and vision assistant it has ability to run multi-modal tasks across both image and text it is integrated with the lava2 and it is available for use right now I tried it and it's very promising definitely give us a taste of what the future look like so today I will give you a demo of how can you try it out as well as dive into a few real world use cases I think could be very interesting so you can go to GitHub and search for lava LL Ava their public page where you can install and the running on your local machine but there's also a demo link which will take you to this page that you can use right away for example I can put this image in and then ask the question what is in the photo and what is the weather click submit a return that the photo featured a Golden Retriever dog laying on grass and the weather is sunny because it has bright sunlight shining on the dock and the Green Glass so you can tell it is more than just doing objective detection in the photo it actually try to understand the photo and doing the reasoning here on the other side I can put another photo in and then ask it to describe photo to me so it says the photo shows a man sitting in a chair where headphones smoking cigar and I can even ask follow-up question like who is the main photo it says it is Elon Musk so again this is not simple object detection it actually try to understand the photo and probably figure out the connection between this photo and other type of text Data around it and to push this Bounder a little bit more I will upload a pretty complex image like this you will actually need to read the image understand what's going on here so around basket please generate a story based on this image alright so it generates story it is able to understand it is four panel dramatic scene and the story is a woman and baby were caught in a dangerous situation possibly a sudden flawed or strong current in the river which is correct and the man jumped into water and saves them which is also crap so this is pretty impressive I think it missed the last part about the man got a medal but I would say it got 80 right what's more impressive is that it is able to understand the facial expression so overall I'm very impressed about the performance and now let's dive into a few real world use cases that I think could be very interesting so one use case I really want to try out is part of the development I was so blown away when open AI made a demo that turned a sketch like this into a real website and if this capability is possible I think there will be a lot of interesting case studies about how designers or product managers can use this during early ideations so I draw a similar type of sketch with the joke website and add the image in in our user same problem that openly I used Write a brief HTML to turn the small cup into a website and let me try this all right so it does generate HTML let me try to copy those things into a visual studio file so this is a joke website I created I would say you only get like 40 of the requirement here it's not as good as a gpd4 but it does really understand the structure and try to recreate it and maybe my sketch is not very good but in more realistic use case I think would be giving gbd a image of the mocha app design and then let it break down the requirement for me so our drag and drop this small cap of the Uber app and then give it from that you are a senior product managers please turn this mocap into detailed product requirement doc okay so it is able to identify that is car sharing app without me mentioning anything that's really good start and it is able to break down interface navigation the user needs to be able to select cars and also make a reservation and unlocking okay that's a bit weird probably to add a bit more extra stuff so it is able to break down a product requirement doc from this and think about if we can use this to create requirement and give it to other type of large language model that is really good at programming like small AI or GPD engineer then this could be some really interesting combination here the next use case is content curation although social media platforms spend a lot of resource on curating the content I want to test out whether this is good at curating and classifying the content so for example I can put this image where it seems pretty violent so I will give the prompts there please give this image a violin score out of 10. so it returns that I will give this image violence score 8 out of 10 because this woman is holding it down and covered in blood this probably pretty aligned with how I will read it as well next I will change to another photo with this should be pretty non-violent all right so it says I will give this image of violence score R1 out of the 10. let's try something more controversial so this image should be funny but there are also elements like fire to see whether it can tell alright so we got the results it is valid score full out 10. this fire which is dangerous but the child's mouth suggests that she is not in media danger and might not be fully aware of the severity of situation this is a pretty good rating so the initial testing results for content curation is really promising and with pumping tactics like view shot prompts it can probably do a really really good job in terms of content curation and some other use case follow along the line with image classification it's in medical and health diagnosed for example I can give an image and ask a question what is wrong with my foot what should I do it is able to give me some basic diagonals that is fungal infection I don't know it's crowd or not so please let me know if it is wrong and then it is able to give some suggestions about how you can fix those issues and I can try another so I can give it an image of plant and ask is there anything wrong with my plan if so how can I fix it and it is able to give me a few suggestions with enough training data I think this could be a really promising use case in medical and biotech while making this video I start thinking about is captcha going to be useless because those image text models can probably crack most of those capture use case easily so let me grab an image like this and then ask it to extra text displaying image okay it kind of gets things in the middle but it's definitely not going to pass let me try another one all right again it is wrong so I guess the good news is out of box capture still going to work with this model but the thing is from all the different research paper it has showed ability with few short prompts you can actually fine tune the performance of the model for specific tasks which means with proper fine tuning and few short prompts it is probably something can be cracked easily and I already found there are models like Microsoft trocrs that is doing extremely good job in terms of capture letters and I can even try with more complex example like this and it works perfectly same for this one as well so I do things that capture verification methods probably won't be effective very soon and there will be new type of verification process that we need to figure out in the last use case I want to share that is really inspiring is from Google so in Google Palm is research paper the integrated multi-model model with a robot so they give a robot a prompt they bring me the rice chips from the drawer and then the robust is able to generate plan like a normal agent based on both Visual and text inputs and then it'll start executing the tasks as a robots move and get new visual inputs it also starts updating the plan as well and this allows robot to complete some very complex tasks because of this both Visual and and tax inputs and this kind of really showcase what is possible with multimodal models so this is a launch multimodal model as I mentioned models like Ava can already be tried out today you can read more details go through the demo website to use right away or go to their GitHub link try to install on your local machine as well I had a lot of fun playing with lava so definitely encourage you go and try out and please comment below about interesting use case you start exploring if you like this video please consider give me a subscribe and I see you next time
Original Description
Explore Multimodal language model, like LLaVA, which enables you reach GPT4 level multimodal abilities, unlock use cases like chat with images
🔗 Links
- Join my community: https://www.skool.com/ai-builder-club/about
- Follow me on twitter: https://twitter.com/jasonzhou1993
- Join my AI email list: https://www.ai-jason.com/
- My discord: https://discord.gg/eZXprSaCDE
- LLaVA link: https://llava-vl.github.io/
⏱️ Timestamps
0:00 Intro
1:03 What is multimodal?
1:23 LLaVA model
2:08 Demo
3:35 Use case: Product development
5:17 Use case: Content curation
6:27 Use case: Medical
7:07 Use case: Captcha
8:09 Use case: Robots
👋🏻 About Me
My name is Jason Zhou, a product designer who shares interesting AI experiments & products. Email me if you need help building AI apps! ask@ai-jason.com
#gpt #autogpt #ai #artificialintelligence #tutorial #stepbystep #openai #llm #largelanguagemodels #largelanguagemodel #chatgpt #multimodality #gpt4 #multimodal #llama2 #llama #llava #machinelearning
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from AI Jason · AI Jason · 16 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
▶
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Build Your Own Auto-GPT Apps without coding Step by Step (Dust.tt Tutorial)
AI Jason
AutoGPT tutorial: Build your personal assistant WITHOUT code (Via Relevance AI)
AI Jason
Create your own AI girlfriend that talks ❤️
AI Jason
How to build with Langchain 10x easier | ⛓️ LangFlow & Flowise
AI Jason
I build an autonomous researcher via GPT | Langchain ⛓️ Tutorial
AI Jason
Smol AI tutorial in 5 mins | Build ENTIRE codebase with a single prompt
AI Jason
Hugging Face + Langchain in 5 mins | Access 200k+ FREE AI models for your AI apps
AI Jason
How to let GPT control anything & 10x powerful | 8 mins tutorial about GPT funtion calling
AI Jason
Extract data & automate EVERYTHING | 10x GPT function calling power
AI Jason
Finally, an AI agent that actually works
AI Jason
"okay, but I want GPT to perform 10x for my specific use case" - Here is how
AI Jason
"Wait..this AI Agent does research for you 24hrs without hallucination?!" - Here is how
AI Jason
"How to give GPT my business knowledge?" - Knowledge embedding 101
AI Jason
“Automation 2.0 coming…No more boring data entry job”
AI Jason
"How to 10x chatbot UX? 🤖 🖼️ " - Add Image Responses to GPT knowledge retrieval apps
AI Jason
“LLAMA2 supercharged with vision & hearing?!” | Multimodal 101 tutorial
AI Jason
"Next Level Prompts?" - 10 mins into advanced prompting
AI Jason
Build AI agent workforce - Multi agent framework with MetaGPT & chatDev
AI Jason
How to scale your AI automation pipeline
AI Jason
AI agent manages community 24/7 - Build Agent workforce ep#1
AI Jason
Autogen - Microsoft's best AI Agent framework that is controllable?
AI Jason
StreamingLLM - Extend Llama2 to 4 million token & 22x faster inference?
AI Jason
AI agent + Vision = Incredible
AI Jason
After 7 days letting AI agents control my email inbox... 📮
AI Jason
How to use New OpenAI DevDay features - GPT4V x TTS demo tutorial
AI Jason
What is Q* | Reinforcement learning 101 & Hypothesis
AI Jason
"Research agent 3.0 - Build a group of AI researchers" - Here is how
AI Jason
GPT4V + Puppeteer = AI agent browse web like human? 🤖
AI Jason
Real Gemini demo? Rebuild with GPT4V + Whisper + TTS
AI Jason
AI Robot's ChatGPT moment at 2024?
AI Jason
GPT5 unlocks LLM System 2 Thinking?
AI Jason
The REAL cost of LLM (And How to reduce 78%+ of Cost)
AI Jason
OpenAI's Agent 2.0: Excited or Scared?
AI Jason
Real time AI Conversation Co-pilot on your phone, Crazy or Creepy?
AI Jason
INSANELY Fast AI Cold Call Agent- built w/ Groq
AI Jason
AI Employees Outperform Human Employees?! Build a real Sales Agent
AI Jason
Future of E-commerce?! Virtual clothing try-on agent
AI Jason
Unlock AI Agent real power?! Long term memory & Self improving
AI Jason
"I want Llama3 to perform 10x with my private knowledge" - Local Agentic RAG w/ llama3
AI Jason
“Wait, this Agent can Scrape ANYTHING?!” - Build universal web scraping agent
AI Jason
"Make Agent 10x cheaper, faster & better?" - LLM System Evaluation 101
AI Jason
Claude 3.5 struggle too?! The $Million dollar challenge
AI Jason
Make your agents 10x more reliable? Flow engineer 101
AI Jason
"I want Llama3.1 to perform 10x with my private knowledge" - Self learning Local Llama3.1 405B
AI Jason
AI process thousands of videos?! - SAM2 deep dive 101
AI Jason
"Wait, I'm using OpenAI Structured Output wrong ?!" - Advanced Structured Output tutorial
AI Jason
How to use Cursor AI build & deploy production app in 20 mins
AI Jason
Best Cursor Workflow that no one talks about...
AI Jason
This is how I scrape 99% websites via LLM
AI Jason
Better than Cursor? Future Agentic Coding available today
AI Jason
EASIEST Way to Train LLM Train w/ unsloth (2x faster with 70% less GPU memory required)
AI Jason
1000x Cursor workflow for building apps
AI Jason
Easiest way to build fancy UI with Cursor/Windsurf/Bolt/Lovable
AI Jason
From $0 to $4m with just 2 people (ComfyUI Crash-course for E-commerce)
AI Jason
Deepseek R1 - The Era of Reasoning models
AI Jason
Yep, o3-mini is WORTH the money - Build your own reasoning agent
AI Jason
The ONLY way to run your own Deepseek on mobile...
AI Jason
Those MCP totally 10x my Cursor workflow…
AI Jason
MCP = Next Big Opportunity? EASIST way to build your own MCP business
AI Jason
Gemini 2.0 blew me away - The future of Multimodal Model
AI Jason
More on: Multimodal LLMs
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
How We Translate 300-Page Books Using Claude Without Hitting Token Limits
Dev.to · 龚旭东
Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking
Medium · AI
Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking
Medium · LLM
A simple way to test model fallbacks with RouterBase
Dev.to · routerbasecom
Chapters (9)
Intro
1:03
What is multimodal?
1:23
LLaVA model
2:08
Demo
3:35
Use case: Product development
5:17
Use case: Content curation
6:27
Use case: Medical
7:07
Use case: Captcha
8:09
Use case: Robots
🎓
Tutor Explanation
DeepCamp AI