Introducing NVIDIA Nemotron 3 Nano Omni

NVIDIA Developer · Advanced ·🧠 Large Language Models ·2w ago
NVIDIA Nemotron 3 Nano Omni is a fully open, 30B-active-3B hybrid Mixture-of-Experts model that unifies video, audio, image, and text reasoning in a single architecture built for agentic AI. It brings the missing multimodal perception layer— handling everything an agent needs to see and hear without the complexity and overhead of stitching together separate vision, speech, and language models. 🛠️ Key Technical Highlights: • Hybrid MoE Architecture: Mamba layers for sequence efficiency and Transformer layers for precision reasoning deliver up to 4x memory and compute efficiency in a sub-agent role. • Native Video Reasoning: 3D convolutional layers for efficient handling of temporal-spatial data in videos, Efficient Video Sampling (EVS) to process longer videos at the same time, both contributing to lower inference cost and a 9.2x higher system efficiency for video use cases (tok/s). • Audio and Speech: Parakeet encoder with Granary ASR datasets handles transcription, spoken QA, and music reasoning in the same context window as text. • High-Resolution Vision: CRADIOv4 encoder powers OCR, document parsing, and screen reading for computer-use agents, 7.5x higher system efficiency for multi-document use cases (tok/s). • Hardware-Aware Inference: FP8, NVFP4, and BF16 quantization on Hopper and Blackwell GPUs with a 256K token context length. Open by Design: Weights, 138 billion multimodal pretraining tokens, 268 million post-training samples, and end-to-end recipes all publicly released. NVIDIA Technical Blog: https://nvda.ws/4u8Mzzl NVIDIA Tech Report: https://nvda.ws/4cCklqT Nemotron 3 Nano Omni on Hugging Face: https://nvda.ws/420h6mR
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →