“LLAMA2 supercharged with vision & hearing?!” | Multimodal 101 tutorial

AI Jason · Beginner ·🧠 Large Language Models ·2y ago

Skills: Multimodal LLMs90%Fine-tuning LLMs80%Prompt Craft70%

Key Takeaways

The video explores Multimodal language models like LLAMA2, enabling users to reach GPT4 level multimodal abilities, and unlock use cases like chat with images, using tools such as LLAMA2, GPT4, and Google Palm 2. The tutorial covers the basics of multimodal AI, generative AI, and large language models, including joint embeddings and shared representations.

Full Transcript

recently large language model like open Ai gpt4 and Google Palm 2 also incredible results about integrating visual inputs and text to perform multimodal tasks yes you hear me right multimodal this kind of next Frontier of generative AI what that means is unlike large language model which takes text input turns them into Vector embedding so that it can understand the relationship between different words and use it to predict the next word coming out of sentence multi-model models can take more than text inputs like image video audio or any type of data really behind the scenes it took a nice different type of data and somehow created joint embedding so that it had shared representation space that captured information from text image video audios and those shared representations enable it to solve problems and run reasonings across different type of data for example you can take a photo of your fridge and ask the model what kind of mails you can cook with all those leftovers it will be able to understand what kind of foods you actually have in the image generate a recipe based based on those information and you can also do some really Advanced generation if you give an image of a grass and also a audio it will be able to generate or find a image with both stocks and grass as elements during the gpd4 demo open AI also showcase ability where it can turn a wireframe sketch like this into a functional HTML website because it can understand image extract core information to complete various tasks so far major large Lounge model like upd4 hasn't released any multimodal feature yet so most of us haven't got a chance to experience the power but there's one multi-model released recently called lava which represents large language and vision assistant it has ability to run multi-modal tasks across both image and text it is integrated with the lava2 and it is available for use right now I tried it and it's very promising definitely give us a taste of what the future look like so today I will give you a demo of how can you try it out as well as dive into a few real world use cases I think could be very interesting so you can go to GitHub and search for lava LL Ava their public page where you can install and the running on your local machine but there's also a demo link which will take you to this page that you can use right away for example I can put this image in and then ask the question what is in the photo and what is the weather click submit a return that the photo featured a Golden Retriever dog laying on grass and the weather is sunny because it has bright sunlight shining on the dock and the Green Glass so you can tell it is more than just doing objective detection in the photo it actually try to understand the photo and doing the reasoning here on the other side I can put another photo in and then ask it to describe photo to me so it says the photo shows a man sitting in a chair where headphones smoking cigar and I can even ask follow-up question like who is the main photo it says it is Elon Musk so again this is not simple object detection it actually try to understand the photo and probably figure out the connection between this photo and other type of text Data around it and to push this Bounder a little bit more I will upload a pretty complex image like this you will actually need to read the image understand what's going on here so around basket please generate a story based on this image alright so it generates story it is able to understand it is four panel dramatic scene and the story is a woman and baby were caught in a dangerous situation possibly a sudden flawed or strong current in the river which is correct and the man jumped into water and saves them which is also crap so this is pretty impressive I think it missed the last part about the man got a medal but I would say it got 80 right what's more impressive is that it is able to understand the facial expression so overall I'm very impressed about the performance and now let's dive into a few real world use cases that I think could be very interesting so one use case I really want to try out is part of the development I was so blown away when open AI made a demo that turned a sketch like this into a real website and if this capability is possible I think there will be a lot of interesting case studies about how designers or product managers can use this during early ideations so I draw a similar type of sketch with the joke website and add the image in in our user same problem that openly I used Write a brief HTML to turn the small cup into a website and let me try this all right so it does generate HTML let me try to copy those things into a visual studio file so this is a joke website I created I would say you only get like 40 of the requirement here it's not as good as a gpd4 but it does really understand the structure and try to recreate it and maybe my sketch is not very good but in more realistic use case I think would be giving gbd a image of the mocha app design and then let it break down the requirement for me so our drag and drop this small cap of the Uber app and then give it from that you are a senior product managers please turn this mocap into detailed product requirement doc okay so it is able to identify that is car sharing app without me mentioning anything that's really good start and it is able to break down interface navigation the user needs to be able to select cars and also make a reservation and unlocking okay that's a bit weird probably to add a bit more extra stuff so it is able to break down a product requirement doc from this and think about if we can use this to create requirement and give it to other type of large language model that is really good at programming like small AI or GPD engineer then this could be some really interesting combination here the next use case is content curation although social media platforms spend a lot of resource on curating the content I want to test out whether this is good at curating and classifying the content so for example I can put this image where it seems pretty violent so I will give the prompts there please give this image a violin score out of 10. so it returns that I will give this image violence score 8 out of 10 because this woman is holding it down and covered in blood this probably pretty aligned with how I will read it as well next I will change to another photo with this should be pretty non-violent all right so it says I will give this image of violence score R1 out of the 10. let's try something more controversial so this image should be funny but there are also elements like fire to see whether it can tell alright so we got the results it is valid score full out 10. this fire which is dangerous but the child's mouth suggests that she is not in media danger and might not be fully aware of the severity of situation this is a pretty good rating so the initial testing results for content curation is really promising and with pumping tactics like view shot prompts it can probably do a really really good job in terms of content curation and some other use case follow along the line with image classification it's in medical and health diagnosed for example I can give an image and ask a question what is wrong with my foot what should I do it is able to give me some basic diagonals that is fungal infection I don't know it's crowd or not so please let me know if it is wrong and then it is able to give some suggestions about how you can fix those issues and I can try another so I can give it an image of plant and ask is there anything wrong with my plan if so how can I fix it and it is able to give me a few suggestions with enough training data I think this could be a really promising use case in medical and biotech while making this video I start thinking about is captcha going to be useless because those image text models can probably crack most of those capture use case easily so let me grab an image like this and then ask it to extra text displaying image okay it kind of gets things in the middle but it's definitely not going to pass let me try another one all right again it is wrong so I guess the good news is out of box capture still going to work with this model but the thing is from all the different research paper it has showed ability with few short prompts you can actually fine tune the performance of the model for specific tasks which means with proper fine tuning and few short prompts it is probably something can be cracked easily and I already found there are models like Microsoft trocrs that is doing extremely good job in terms of capture letters and I can even try with more complex example like this and it works perfectly same for this one as well so I do things that capture verification methods probably won't be effective very soon and there will be new type of verification process that we need to figure out in the last use case I want to share that is really inspiring is from Google so in Google Palm is research paper the integrated multi-model model with a robot so they give a robot a prompt they bring me the rice chips from the drawer and then the robust is able to generate plan like a normal agent based on both Visual and text inputs and then it'll start executing the tasks as a robots move and get new visual inputs it also starts updating the plan as well and this allows robot to complete some very complex tasks because of this both Visual and and tax inputs and this kind of really showcase what is possible with multimodal models so this is a launch multimodal model as I mentioned models like Ava can already be tried out today you can read more details go through the demo website to use right away or go to their GitHub link try to install on your local machine as well I had a lot of fun playing with lava so definitely encourage you go and try out and please comment below about interesting use case you start exploring if you like this video please consider give me a subscribe and I see you next time

Original Description

Explore Multimodal language model, like LLaVA, which enables you reach GPT4 level multimodal abilities, unlock use cases like chat with images 🔗 Links - Join my community: https://www.skool.com/ai-builder-club/about - Follow me on twitter: https://twitter.com/jasonzhou1993 - Join my AI email list: https://www.ai-jason.com/ - My discord: https://discord.gg/eZXprSaCDE - LLaVA link: https://llava-vl.github.io/ ⏱️ Timestamps 0:00 Intro 1:03 What is multimodal? 1:23 LLaVA model 2:08 Demo 3:35 Use case: Product development 5:17 Use case: Content curation 6:27 Use case: Medical 7:07 Use case: Captcha 8:09 Use case: Robots 👋🏻 About Me My name is Jason Zhou, a product designer who shares interesting AI experiments & products. Email me if you need help building AI apps! ask@ai-jason.com #gpt #autogpt #ai #artificialintelligence #tutorial #stepbystep #openai #llm #largelanguagemodels #largelanguagemodel #chatgpt #multimodality #gpt4 #multimodal #llama2 #llama #llava #machinelearning

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from AI Jason · AI Jason · 16 of 60

← Previous Next →

Build Your Own Auto-GPT Apps without coding Step by Step (Dust.tt Tutorial)

Build Your Own Auto-GPT Apps without coding Step by Step (Dust.tt Tutorial)

AutoGPT tutorial: Build your personal assistant WITHOUT code (Via Relevance AI)

AutoGPT tutorial: Build your personal assistant WITHOUT code (Via Relevance AI)

Create your own AI girlfriend that talks ❤️

Create your own AI girlfriend that talks ❤️

How to build with Langchain 10x easier | ⛓️ LangFlow & Flowise

How to build with Langchain 10x easier | ⛓️ LangFlow & Flowise

I build an autonomous researcher via GPT | Langchain ⛓️ Tutorial

I build an autonomous researcher via GPT | Langchain ⛓️ Tutorial

Smol AI tutorial in 5 mins | Build ENTIRE codebase with a single prompt

Smol AI tutorial in 5 mins | Build ENTIRE codebase with a single prompt

Hugging Face + Langchain in 5 mins | Access 200k+ FREE AI models for your AI apps

Hugging Face + Langchain in 5 mins | Access 200k+ FREE AI models for your AI apps

How to let GPT control anything & 10x powerful | 8 mins tutorial about GPT funtion calling

How to let GPT control anything & 10x powerful | 8 mins tutorial about GPT funtion calling

Extract data & automate EVERYTHING | 10x GPT function calling power

Extract data & automate EVERYTHING | 10x GPT function calling power

Finally, an AI agent that actually works

Finally, an AI agent that actually works

"okay, but I want GPT to perform 10x for my specific use case" - Here is how

"okay, but I want GPT to perform 10x for my specific use case" - Here is how

"Wait..this AI Agent does research for you 24hrs without hallucination?!" - Here is how

"Wait..this AI Agent does research for you 24hrs without hallucination?!" - Here is how

"How to give GPT my business knowledge?" - Knowledge embedding 101

"How to give GPT my business knowledge?" - Knowledge embedding 101

“Automation 2.0 coming…No more boring data entry job”

“Automation 2.0 coming…No more boring data entry job”

"How to 10x chatbot UX? 🤖 🖼️ " - Add Image Responses to GPT knowledge retrieval apps

"How to 10x chatbot UX? 🤖 🖼️ " - Add Image Responses to GPT knowledge retrieval apps

“LLAMA2 supercharged with vision & hearing?!” | Multimodal 101 tutorial

“LLAMA2 supercharged with vision & hearing?!” | Multimodal 101 tutorial

"Next Level Prompts?" - 10 mins into advanced prompting

"Next Level Prompts?" - 10 mins into advanced prompting

Build AI agent workforce - Multi agent framework with MetaGPT & chatDev

Build AI agent workforce - Multi agent framework with MetaGPT & chatDev

How to scale your AI automation pipeline

How to scale your AI automation pipeline

AI agent manages community 24/7 - Build Agent workforce ep#1

AI agent manages community 24/7 - Build Agent workforce ep#1

Autogen - Microsoft's best AI Agent framework that is controllable?

Autogen - Microsoft's best AI Agent framework that is controllable?

StreamingLLM - Extend Llama2 to 4 million token & 22x faster inference?

StreamingLLM - Extend Llama2 to 4 million token & 22x faster inference?

AI agent + Vision = Incredible

AI agent + Vision = Incredible

After 7 days letting AI agents control my email inbox... 📮

After 7 days letting AI agents control my email inbox... 📮

How to use New OpenAI DevDay features - GPT4V x TTS demo tutorial

How to use New OpenAI DevDay features - GPT4V x TTS demo tutorial

What is Q* | Reinforcement learning 101 & Hypothesis

What is Q* | Reinforcement learning 101 & Hypothesis

"Research agent 3.0 - Build a group of AI researchers" - Here is how

"Research agent 3.0 - Build a group of AI researchers" - Here is how

GPT4V + Puppeteer = AI agent browse web like human? 🤖

GPT4V + Puppeteer = AI agent browse web like human? 🤖

Real Gemini demo? Rebuild with GPT4V + Whisper + TTS

Real Gemini demo? Rebuild with GPT4V + Whisper + TTS

AI Robot's ChatGPT moment at 2024?

AI Robot's ChatGPT moment at 2024?

GPT5 unlocks LLM System 2 Thinking?

GPT5 unlocks LLM System 2 Thinking?

The REAL cost of LLM (And How to reduce 78%+ of Cost)

The REAL cost of LLM (And How to reduce 78%+ of Cost)

OpenAI's Agent 2.0: Excited or Scared?

OpenAI's Agent 2.0: Excited or Scared?

Real time AI Conversation Co-pilot on your phone, Crazy or Creepy?

Real time AI Conversation Co-pilot on your phone, Crazy or Creepy?

INSANELY Fast AI Cold Call Agent- built w/ Groq

INSANELY Fast AI Cold Call Agent- built w/ Groq

AI Employees Outperform Human Employees?! Build a real Sales Agent

AI Employees Outperform Human Employees?! Build a real Sales Agent

Future of E-commerce?! Virtual clothing try-on agent

Future of E-commerce?! Virtual clothing try-on agent

Unlock AI Agent real power?! Long term memory & Self improving

Unlock AI Agent real power?! Long term memory & Self improving

"I want Llama3 to perform 10x with my private knowledge" - Local Agentic RAG w/ llama3

"I want Llama3 to perform 10x with my private knowledge" - Local Agentic RAG w/ llama3

“Wait, this Agent can Scrape ANYTHING?!” - Build universal web scraping agent

“Wait, this Agent can Scrape ANYTHING?!” - Build universal web scraping agent

"Make Agent 10x cheaper, faster & better?" - LLM System Evaluation 101

"Make Agent 10x cheaper, faster & better?" - LLM System Evaluation 101

Claude 3.5 struggle too?! The $Million dollar challenge

Claude 3.5 struggle too?! The $Million dollar challenge

Make your agents 10x more reliable? Flow engineer 101

Make your agents 10x more reliable? Flow engineer 101

"I want Llama3.1 to perform 10x with my private knowledge" - Self learning Local Llama3.1 405B

"I want Llama3.1 to perform 10x with my private knowledge" - Self learning Local Llama3.1 405B

AI process thousands of videos?! - SAM2 deep dive 101

AI process thousands of videos?! - SAM2 deep dive 101

"Wait, I'm using OpenAI Structured Output wrong ?!" - Advanced Structured Output tutorial

"Wait, I'm using OpenAI Structured Output wrong ?!" - Advanced Structured Output tutorial

How to use Cursor AI build & deploy production app in 20 mins

How to use Cursor AI build & deploy production app in 20 mins

Best Cursor Workflow that no one talks about...

Best Cursor Workflow that no one talks about...

This is how I scrape 99% websites via LLM

This is how I scrape 99% websites via LLM

Better than Cursor? Future Agentic Coding available today

Better than Cursor? Future Agentic Coding available today

EASIEST Way to Train LLM Train w/ unsloth (2x faster with 70% less GPU memory required)

EASIEST Way to Train LLM Train w/ unsloth (2x faster with 70% less GPU memory required)

1000x Cursor workflow for building apps

1000x Cursor workflow for building apps

Easiest way to build fancy UI with Cursor/Windsurf/Bolt/Lovable

Easiest way to build fancy UI with Cursor/Windsurf/Bolt/Lovable

From $0 to $4m with just 2 people (ComfyUI Crash-course for E-commerce)

From $0 to $4m with just 2 people (ComfyUI Crash-course for E-commerce)

Deepseek R1 - The Era of Reasoning models

Deepseek R1 - The Era of Reasoning models

Yep, o3-mini is WORTH the money - Build your own reasoning agent

Yep, o3-mini is WORTH the money - Build your own reasoning agent

The ONLY way to run your own Deepseek on mobile...

The ONLY way to run your own Deepseek on mobile...

Those MCP totally 10x my Cursor workflow…

Those MCP totally 10x my Cursor workflow…

MCP = Next Big Opportunity? EASIST way to build your own MCP business

MCP = Next Big Opportunity? EASIST way to build your own MCP business

Gemini 2.0 blew me away - The future of Multimodal Model

Gemini 2.0 blew me away - The future of Multimodal Model

This video tutorial introduces multimodal language models like LLAMA2, which can take visual inputs and text to perform tasks, and enables users to reach GPT4 level multimodal abilities. The tutorial covers the basics of multimodal AI, generative AI, and large language models, including joint embeddings and shared representations. By the end of the tutorial, users will be able to build multimodal models, integrate visual inputs and text, and unlock use cases like chat with images.

Key Takeaways

Draw a sketch of a website
Add an image to the sketch
Create HTML from the sketch
Break down product requirements from a design
Curate and classify content
Fine-tune multimodal models for specific tasks
Use joint embeddings to capture information from text, image, video, and audio

💡 Multimodal language models like LLAMA2 can be used for a wide range of tasks, from creating HTML from sketches to diagnosing medical conditions from images and text, and can be fine-tuned for specific tasks with few short prompts.

🔒 Pro feature: Ask AI to explain this lesson →

More on: Multimodal LLMs

View skill →

Google Veo 3 Tutorial: How to create AI Videos in Flow, Gemini or Google Vids?

Google Veo 3 Tutorial: How to create AI Videos in Flow, Gemini or Google Vids?

AI Tool Journey

NVIDIA Clara Guardian Virtual Patient Assistant

NVIDIA Clara Guardian Virtual Patient Assistant

NVIDIA Developer

Building Multimodal Search and RAG

Building Multimodal Search and RAG

Midjourney Trick: Consistent Character in Different Images

Midjourney Trick: Consistent Character in Different Images

Ollama Multimodal: EASILY setup Llava locally & Integrate API

Ollama Multimodal: EASILY setup Llava locally & Integrate API

The ONLY Real Time Speech AI that can run locally!!!

The ONLY Real Time Speech AI that can run locally!!!

Related AI Lessons

How We Translate 300-Page Books Using Claude Without Hitting Token Limits

Learn how to translate long documents using Claude without hitting token limits by breaking them into overlapping chunks

Dev.to · 龚旭东

Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking

Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve model performance

Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking

Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve LLM performance

A simple way to test model fallbacks with RouterBase

Learn to test model fallbacks with RouterBase using a simple fallback wrapper and OpenAI-compatible API surface

Dev.to · routerbasecom

Chapters (9)

Intro

1:03 What is multimodal?

1:23 LLaVA model

2:08 Demo

3:35 Use case: Product development

5:17 Use case: Content curation

6:27 Use case: Medical

7:07 Use case: Captcha

8:09 Use case: Robots

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)