Skills › Large Language Models

Multimodal LLMs

Work with vision-language models, audio LLMs, and multimodal pipelines.

0%
Confidence · no data yet
Sign in to track

After this skill you can…

  • Use GPT-4V / Claude Vision for image understanding
  • Build document OCR pipelines
  • Chain audio → text → action workflows

Prerequisites

Watch (10 videos)

Building Multimodal Search and RAG
Coursera · intermediate hands-on
→ Build a multimodal search system with LLMs→ Implement RAG with multimedia data
Large Multimodal Model Prompting with Gemini
Coursera · beginner hands-on
→ Build applications with Gemini→ Unify text images videos with multimodal models
Multimodal Requirements Development
Daniel Finkenstadt · advanced hands-on
→ Use GPT4 for multimodal interactions→ Derive technical requirements from oral problem statements
Gemini 3: Code a visualization of nuclear fusion
Google DeepMind · intermediate hands-on
→ Generate multimodal content→ Code a complex visual simulation
AI Generated Video Game is NOT SCI-FI Anymore!!!
1littlecoder · advanced hands-on
→ Generate interactive 3D worlds with AI→ Create procedural content for games
Google Veo 3 Tutorial: How to create AI Videos in Flow, Gemini or Google Vids?
AI Tool Journey · beginner hands-on
→ Generate AI videos with Google Veo 3→ Use Veo 3 in Gemini, Flow, and Google Vids
Ollama Multimodal: EASILY setup Llava locally & Integrate API
Mervin Praison · intermediate hands-on
→ Setup Ollama Multimodal with Llava→ Integrate multimodal AI API
Create Your First AI Video
Coursera · beginner hands-on
→ Create AI videos from text prompts→ Use Veo 3 for video generation
JETSON AI LAB | One-Shot Multimodal RAG on Jetson Orin
NVIDIA Developer · beginner hands-on
→ Perform one-shot classification/recognition with multimodal RAG→ Tag images in vectorDB at runtime
RIP KLING AI! FREE NSFW 120s IMAGE TO VIDEO KING on 6 GB VRAM!
Aitrepreneur · beginner hands-on
→ Generate videos from images using FramePack→ Utilize Webui for video creation