On-Device LLM Inference on Android With ExecuTorch and Qualcomm QNN - Shivay Lamba & Kartikey Rawat
On-Device LLM Inference on Android With ExecuTorch and Qualcomm QNN - Shivay Lamba & Kartikey Rawat, Qualcomm
Multimodal models like CLIP are typically deployed in the cloud due to their size and computational demands, limiting their use in latency-sensitive, privacy-preserving, and offline-first applications. This talk demonstrates how one can run fully on-device CLIP inference on Android using ExecuTorch with the Qualcomm QNN backend, enabling real-time vision–language understanding without server dependency.
One can run models like CLIP (ViT-B/32) model entirely on edge devices, leveraging QNN for hardware-accelerated inference. A key focus of the talk is a deep dive into ExecuTorch optimizations for QNN, including graph lowering, operator fusion, quantization strategies, memory planning, and backend-specific execution choices that materially impact latency, memory footprint, and power consumption.
The talk will cover architectural insights, model export and compilation workflows, and real-world benchmarks covering latency, memory usage, and power efficiency. This talk highlights how large multimodal PyTorch models can be made production-ready on edge devices, unlocking new classes of private, offline-capable AI applications.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: LLM Engineering
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
This Tool is Changing How Chinese Devs Build AI Apps
Dev.to AI
Mental Algorithms: How AI Changes the Cost of Thinking
Dev.to AI
The AI Content System I Built to Generate Viral LinkedIn Posts Started Bringing Clients…
Medium · Programming
$5,000/Month AI Income: Local Business Review Translation Service
Medium · ChatGPT
🎓
Tutor Explanation
DeepCamp AI