“LLAMA2 supercharged with vision & hearing?!” | Multimodal 101 tutorial

AI Jason · Beginner ·🧠 Large Language Models ·2y ago

Key Takeaways

The video explores Multimodal language models like LLAMA2, enabling users to reach GPT4 level multimodal abilities, and unlock use cases like chat with images, using tools such as LLAMA2, GPT4, and Google Palm 2. The tutorial covers the basics of multimodal AI, generative AI, and large language models, including joint embeddings and shared representations.

Full Transcript

recently large language model like open Ai gpt4 and Google Palm 2 also incredible results about integrating visual inputs and text to perform multimodal tasks yes you hear me right multimodal this kind of next Frontier of generative AI what that means is unlike large language model which takes text input turns them into Vector embedding so that it can understand the relationship between different words and use it to predict the next word coming out of sentence multi-model models can take more than text inputs like image video audio or any type of data really behind the scenes it took a nice different type of data and somehow created joint embedding so that it had shared representation space that captured information from text image video audios and those shared representations enable it to solve problems and run reasonings across different type of data for example you can take a photo of your fridge and ask the model what kind of mails you can cook with all those leftovers it will be able to understand what kind of foods you actually have in the image generate a recipe based based on those information and you can also do some really Advanced generation if you give an image of a grass and also a audio it will be able to generate or find a image with both stocks and grass as elements during the gpd4 demo open AI also showcase ability where it can turn a wireframe sketch like this into a functional HTML website because it can understand image extract core information to complete various tasks so far major large Lounge model like upd4 hasn't released any multimodal feature yet so most of us haven't got a chance to experience the power but there's one multi-model released recently called lava which represents large language and vision assistant it has ability to run multi-modal tasks across both image and text it is integrated with the lava2 and it is available for use right now I tried it and it's very promising definitely give us a taste of what the future look like so today I will give you a demo of how can you try it out as well as dive into a few real world use cases I think could be very interesting so you can go to GitHub and search for lava LL Ava their public page where you can install and the running on your local machine but there's also a demo link which will take you to this page that you can use right away for example I can put this image in and then ask the question what is in the photo and what is the weather click submit a return that the photo featured a Golden Retriever dog laying on grass and the weather is sunny because it has bright sunlight shining on the dock and the Green Glass so you can tell it is more than just doing objective detection in the photo it actually try to understand the photo and doing the reasoning here on the other side I can put another photo in and then ask it to describe photo to me so it says the photo shows a man sitting in a chair where headphones smoking cigar and I can even ask follow-up question like who is the main photo it says it is Elon Musk so again this is not simple object detection it actually try to understand the photo and probably figure out the connection between this photo and other type of text Data around it and to push this Bounder a little bit more I will upload a pretty complex image like this you will actually need to read the image understand what's going on here so around basket please generate a story based on this image alright so it generates story it is able to understand it is four panel dramatic scene and the story is a woman and baby were caught in a dangerous situation possibly a sudden flawed or strong current in the river which is correct and the man jumped into water and saves them which is also crap so this is pretty impressive I think it missed the last part about the man got a medal but I would say it got 80 right what's more impressive is that it is able to understand the facial expression so overall I'm very impressed about the performance and now let's dive into a few real world use cases that I think could be very interesting so one use case I really want to try out is part of the development I was so blown away when open AI made a demo that turned a sketch like this into a real website and if this capability is possible I think there will be a lot of interesting case studies about how designers or product managers can use this during early ideations so I draw a similar type of sketch with the joke website and add the image in in our user same problem that openly I used Write a brief HTML to turn the small cup into a website and let me try this all right so it does generate HTML let me try to copy those things into a visual studio file so this is a joke website I created I would say you only get like 40 of the requirement here it's not as good as a gpd4 but it does really understand the structure and try to recreate it and maybe my sketch is not very good but in more realistic use case I think would be giving gbd a image of the mocha app design and then let it break down the requirement for me so our drag and drop this small cap of the Uber app and then give it from that you are a senior product managers please turn this mocap into detailed product requirement doc okay so it is able to identify that is car sharing app without me mentioning anything that's really good start and it is able to break down interface navigation the user needs to be able to select cars and also make a reservation and unlocking okay that's a bit weird probably to add a bit more extra stuff so it is able to break down a product requirement doc from this and think about if we can use this to create requirement and give it to other type of large language model that is really good at programming like small AI or GPD engineer then this could be some really interesting combination here the next use case is content curation although social media platforms spend a lot of resource on curating the content I want to test out whether this is good at curating and classifying the content so for example I can put this image where it seems pretty violent so I will give the prompts there please give this image a violin score out of 10. so it returns that I will give this image violence score 8 out of 10 because this woman is holding it down and covered in blood this probably pretty aligned with how I will read it as well next I will change to another photo with this should be pretty non-violent all right so it says I will give this image of violence score R1 out of the 10. let's try something more controversial so this image should be funny but there are also elements like fire to see whether it can tell alright so we got the results it is valid score full out 10. this fire which is dangerous but the child's mouth suggests that she is not in media danger and might not be fully aware of the severity of situation this is a pretty good rating so the initial testing results for content curation is really promising and with pumping tactics like view shot prompts it can probably do a really really good job in terms of content curation and some other use case follow along the line with image classification it's in medical and health diagnosed for example I can give an image and ask a question what is wrong with my foot what should I do it is able to give me some basic diagonals that is fungal infection I don't know it's crowd or not so please let me know if it is wrong and then it is able to give some suggestions about how you can fix those issues and I can try another so I can give it an image of plant and ask is there anything wrong with my plan if so how can I fix it and it is able to give me a few suggestions with enough training data I think this could be a really promising use case in medical and biotech while making this video I start thinking about is captcha going to be useless because those image text models can probably crack most of those capture use case easily so let me grab an image like this and then ask it to extra text displaying image okay it kind of gets things in the middle but it's definitely not going to pass let me try another one all right again it is wrong so I guess the good news is out of box capture still going to work with this model but the thing is from all the different research paper it has showed ability with few short prompts you can actually fine tune the performance of the model for specific tasks which means with proper fine tuning and few short prompts it is probably something can be cracked easily and I already found there are models like Microsoft trocrs that is doing extremely good job in terms of capture letters and I can even try with more complex example like this and it works perfectly same for this one as well so I do things that capture verification methods probably won't be effective very soon and there will be new type of verification process that we need to figure out in the last use case I want to share that is really inspiring is from Google so in Google Palm is research paper the integrated multi-model model with a robot so they give a robot a prompt they bring me the rice chips from the drawer and then the robust is able to generate plan like a normal agent based on both Visual and text inputs and then it'll start executing the tasks as a robots move and get new visual inputs it also starts updating the plan as well and this allows robot to complete some very complex tasks because of this both Visual and and tax inputs and this kind of really showcase what is possible with multimodal models so this is a launch multimodal model as I mentioned models like Ava can already be tried out today you can read more details go through the demo website to use right away or go to their GitHub link try to install on your local machine as well I had a lot of fun playing with lava so definitely encourage you go and try out and please comment below about interesting use case you start exploring if you like this video please consider give me a subscribe and I see you next time

Original Description

Explore Multimodal language model, like LLaVA, which enables you reach GPT4 level multimodal abilities, unlock use cases like chat with images 🔗 Links - Join my community: https://www.skool.com/ai-builder-club/about - Follow me on twitter: https://twitter.com/jasonzhou1993 - Join my AI email list: https://www.ai-jason.com/ - My discord: https://discord.gg/eZXprSaCDE - LLaVA link: https://llava-vl.github.io/ ⏱️ Timestamps 0:00 Intro 1:03 What is multimodal? 1:23 LLaVA model 2:08 Demo 3:35 Use case: Product development 5:17 Use case: Content curation 6:27 Use case: Medical 7:07 Use case: Captcha 8:09 Use case: Robots 👋🏻 About Me My name is Jason Zhou, a product designer who shares interesting AI experiments & products. Email me if you need help building AI apps! ask@ai-jason.com #gpt #autogpt #ai #artificialintelligence #tutorial #stepbystep #openai #llm #largelanguagemodels #largelanguagemodel #chatgpt #multimodality #gpt4 #multimodal #llama2 #llama #llava #machinelearning
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AI Jason · AI Jason · 16 of 60

1 Build Your Own Auto-GPT Apps without coding Step by Step (Dust.tt Tutorial)
Build Your Own Auto-GPT Apps without coding Step by Step (Dust.tt Tutorial)
AI Jason
2 AutoGPT tutorial: Build your personal assistant WITHOUT code (Via Relevance AI)
AutoGPT tutorial: Build your personal assistant WITHOUT code (Via Relevance AI)
AI Jason
3 Create your own AI girlfriend that talks ❤️
Create your own AI girlfriend that talks ❤️
AI Jason
4 How to build with Langchain 10x easier | ⛓️ LangFlow & Flowise
How to build with Langchain 10x easier | ⛓️ LangFlow & Flowise
AI Jason
5 I build an autonomous researcher via GPT | Langchain ⛓️ Tutorial
I build an autonomous researcher via GPT | Langchain ⛓️ Tutorial
AI Jason
6 Smol AI tutorial in 5 mins | Build ENTIRE codebase with a single prompt
Smol AI tutorial in 5 mins | Build ENTIRE codebase with a single prompt
AI Jason
7 Hugging Face + Langchain in 5 mins | Access 200k+ FREE AI models for your AI apps
Hugging Face + Langchain in 5 mins | Access 200k+ FREE AI models for your AI apps
AI Jason
8 How to let GPT control anything & 10x powerful | 8 mins tutorial about GPT funtion calling
How to let GPT control anything & 10x powerful | 8 mins tutorial about GPT funtion calling
AI Jason
9 Extract data & automate EVERYTHING | 10x GPT function calling power
Extract data & automate EVERYTHING | 10x GPT function calling power
AI Jason
10 Finally, an AI agent that actually works
Finally, an AI agent that actually works
AI Jason
11 "okay, but I want GPT to perform 10x for my specific use case" - Here is how
"okay, but I want GPT to perform 10x for my specific use case" - Here is how
AI Jason
12 "Wait..this AI Agent does research for you 24hrs without hallucination?!" - Here is how
"Wait..this AI Agent does research for you 24hrs without hallucination?!" - Here is how
AI Jason
13 "How to give GPT my business knowledge?" - Knowledge embedding 101
"How to give GPT my business knowledge?" - Knowledge embedding 101
AI Jason
14 “Automation 2.0 coming…No more boring data entry job”
“Automation 2.0 coming…No more boring data entry job”
AI Jason
15 "How to 10x chatbot UX? 🤖 🖼️ " - Add Image Responses to GPT knowledge retrieval apps
"How to 10x chatbot UX? 🤖 🖼️ " - Add Image Responses to GPT knowledge retrieval apps
AI Jason
“LLAMA2 supercharged with vision & hearing?!” | Multimodal 101 tutorial
“LLAMA2 supercharged with vision & hearing?!” | Multimodal 101 tutorial
AI Jason
17 "Next Level Prompts?" - 10 mins into advanced prompting
"Next Level Prompts?" - 10 mins into advanced prompting
AI Jason
18 Build AI agent workforce - Multi agent framework with MetaGPT & chatDev
Build AI agent workforce - Multi agent framework with MetaGPT & chatDev
AI Jason
19 How to scale your AI automation pipeline
How to scale your AI automation pipeline
AI Jason
20 AI agent manages community 24/7 - Build Agent workforce ep#1
AI agent manages community 24/7 - Build Agent workforce ep#1
AI Jason
21 Autogen - Microsoft's best AI Agent framework that is controllable?
Autogen - Microsoft's best AI Agent framework that is controllable?
AI Jason
22 StreamingLLM - Extend Llama2 to 4 million token & 22x faster inference?
StreamingLLM - Extend Llama2 to 4 million token & 22x faster inference?
AI Jason
23 AI agent + Vision = Incredible
AI agent + Vision = Incredible
AI Jason
24 After 7 days letting AI agents control my email inbox... 📮
After 7 days letting AI agents control my email inbox... 📮
AI Jason
25 How to use New OpenAI DevDay features - GPT4V x TTS demo tutorial
How to use New OpenAI DevDay features - GPT4V x TTS demo tutorial
AI Jason
26 What is Q* | Reinforcement learning 101 & Hypothesis
What is Q* | Reinforcement learning 101 & Hypothesis
AI Jason
27 "Research agent 3.0 - Build a group of AI researchers" - Here is how
"Research agent 3.0 - Build a group of AI researchers" - Here is how
AI Jason
28 GPT4V + Puppeteer = AI agent browse web like human? 🤖
GPT4V + Puppeteer = AI agent browse web like human? 🤖
AI Jason
29 Real Gemini demo? Rebuild with GPT4V + Whisper + TTS
Real Gemini demo? Rebuild with GPT4V + Whisper + TTS
AI Jason
30 AI Robot's ChatGPT moment at 2024?
AI Robot's ChatGPT moment at 2024?
AI Jason
31 GPT5 unlocks LLM System 2 Thinking?
GPT5 unlocks LLM System 2 Thinking?
AI Jason
32 The REAL cost of LLM (And How to reduce 78%+ of Cost)
The REAL cost of LLM (And How to reduce 78%+ of Cost)
AI Jason
33 OpenAI's Agent 2.0: Excited or Scared?
OpenAI's Agent 2.0: Excited or Scared?
AI Jason
34 Real time AI Conversation Co-pilot on your phone, Crazy or Creepy?
Real time AI Conversation Co-pilot on your phone, Crazy or Creepy?
AI Jason
35 INSANELY Fast AI Cold Call Agent- built w/ Groq
INSANELY Fast AI Cold Call Agent- built w/ Groq
AI Jason
36 AI Employees Outperform Human Employees?! Build a real Sales Agent
AI Employees Outperform Human Employees?! Build a real Sales Agent
AI Jason
37 Future of E-commerce?! Virtual clothing try-on agent
Future of E-commerce?! Virtual clothing try-on agent
AI Jason
38 Unlock AI Agent real power?! Long term memory & Self improving
Unlock AI Agent real power?! Long term memory & Self improving
AI Jason
39 "I want Llama3 to perform 10x with my private knowledge" - Local Agentic RAG w/ llama3
"I want Llama3 to perform 10x with my private knowledge" - Local Agentic RAG w/ llama3
AI Jason
40 “Wait, this Agent can Scrape ANYTHING?!” - Build universal web scraping agent
“Wait, this Agent can Scrape ANYTHING?!” - Build universal web scraping agent
AI Jason
41 "Make Agent 10x cheaper, faster & better?" -  LLM System Evaluation 101
"Make Agent 10x cheaper, faster & better?" - LLM System Evaluation 101
AI Jason
42 Claude 3.5 struggle too?! The $Million dollar challenge
Claude 3.5 struggle too?! The $Million dollar challenge
AI Jason
43 Make your agents 10x more reliable? Flow engineer 101
Make your agents 10x more reliable? Flow engineer 101
AI Jason
44 "I want Llama3.1 to perform 10x with my private knowledge" - Self learning Local Llama3.1 405B
"I want Llama3.1 to perform 10x with my private knowledge" - Self learning Local Llama3.1 405B
AI Jason
45 AI process thousands of videos?! - SAM2 deep dive 101
AI process thousands of videos?! - SAM2 deep dive 101
AI Jason
46 "Wait, I'm using OpenAI Structured Output wrong ?!" - Advanced Structured Output tutorial
"Wait, I'm using OpenAI Structured Output wrong ?!" - Advanced Structured Output tutorial
AI Jason
47 How to use Cursor AI build & deploy production app in 20 mins
How to use Cursor AI build & deploy production app in 20 mins
AI Jason
48 Best Cursor Workflow that no one talks about...
Best Cursor Workflow that no one talks about...
AI Jason
49 This is how I scrape 99% websites via LLM
This is how I scrape 99% websites via LLM
AI Jason
50 Better than Cursor? Future Agentic Coding available today
Better than Cursor? Future Agentic Coding available today
AI Jason
51 EASIEST Way to Train LLM Train w/ unsloth (2x faster with 70% less GPU memory required)
EASIEST Way to Train LLM Train w/ unsloth (2x faster with 70% less GPU memory required)
AI Jason
52 1000x Cursor workflow for building apps
1000x Cursor workflow for building apps
AI Jason
53 Easiest way to build fancy UI with Cursor/Windsurf/Bolt/Lovable
Easiest way to build fancy UI with Cursor/Windsurf/Bolt/Lovable
AI Jason
54 From $0 to $4m with just 2 people (ComfyUI Crash-course for E-commerce)
From $0 to $4m with just 2 people (ComfyUI Crash-course for E-commerce)
AI Jason
55 Deepseek R1 - The Era of Reasoning models
Deepseek R1 - The Era of Reasoning models
AI Jason
56 Yep, o3-mini is WORTH the money - Build your own reasoning agent
Yep, o3-mini is WORTH the money - Build your own reasoning agent
AI Jason
57 The ONLY way to run your own Deepseek on mobile...
The ONLY way to run your own Deepseek on mobile...
AI Jason
58 Those MCP totally 10x my Cursor workflow…
Those MCP totally 10x my Cursor workflow…
AI Jason
59 MCP = Next Big Opportunity? EASIST way to build your own MCP business
MCP = Next Big Opportunity? EASIST way to build your own MCP business
AI Jason
60 Gemini 2.0 blew me away - The future of Multimodal Model
Gemini 2.0 blew me away - The future of Multimodal Model
AI Jason

This video tutorial introduces multimodal language models like LLAMA2, which can take visual inputs and text to perform tasks, and enables users to reach GPT4 level multimodal abilities. The tutorial covers the basics of multimodal AI, generative AI, and large language models, including joint embeddings and shared representations. By the end of the tutorial, users will be able to build multimodal models, integrate visual inputs and text, and unlock use cases like chat with images.

Key Takeaways
  1. Draw a sketch of a website
  2. Add an image to the sketch
  3. Create HTML from the sketch
  4. Break down product requirements from a design
  5. Curate and classify content
  6. Fine-tune multimodal models for specific tasks
  7. Use joint embeddings to capture information from text, image, video, and audio
💡 Multimodal language models like LLAMA2 can be used for a wide range of tasks, from creating HTML from sketches to diagnosing medical conditions from images and text, and can be fine-tuned for specific tasks with few short prompts.

Related AI Lessons

How We Translate 300-Page Books Using Claude Without Hitting Token Limits
Learn how to translate long documents using Claude without hitting token limits by breaking them into overlapping chunks
Dev.to · 龚旭东
Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking
Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve model performance
Medium · AI
Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking
Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve LLM performance
Medium · LLM
A simple way to test model fallbacks with RouterBase
Learn to test model fallbacks with RouterBase using a simple fallback wrapper and OpenAI-compatible API surface
Dev.to · routerbasecom

Chapters (9)

Intro
1:03 What is multimodal?
1:23 LLaVA model
2:08 Demo
3:35 Use case: Product development
5:17 Use case: Content curation
6:27 Use case: Medical
7:07 Use case: Captcha
8:09 Use case: Robots
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →