Build Multimodal Generative AI Applications
Ready to level up your GenAI skills? Step into the exciting world of multimodal AI, where language, images, and speech come together to build smarter, more interactive applications.
In this hands-on course, you’ll learn how to build systems that work across multiple modalities, from creating AI-powered storytellers and meeting assistants to developing image captioning tools and video generation apps.
You’ll gain experience with real-world tools like IBM’s Granite, OpenAI’s Whisper, Sora and DALL·E, Meta’s Llama, Mistral’s Mixtral, and Gradio. Plus, you'll explore multimodal search, question answering, and retrieval systems that combine text, speech, and visual data.
By the end of the course, you’ll be able to design and build full-stack multimodal AI solutions using Python and frameworks like Flask and Gradio.
If you’re looking to gain in-demand skills for building the next generation of AI applications, enroll today and power up your AI career!
Watch on Coursera ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: Multimodal LLMs
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
What makes an AI image workflow useful for real commercial output?
Dev.to AI
How to Write Better AI Image Prompts for Midjourney (With Examples That Actually Work)
Medium · ChatGPT
Image to Video AI: The Complete Workflow Playbook That Actually Produces Results
Medium · AI
Image Harvest v1.0.2: Internationalization, Free Pro Trial & Quality-of-Life Improvements
Dev.to · kyriewen
🎓
Tutor Explanation
DeepCamp AI