Lightning Talk: Accelerating On-Device ML Inference With ExecuTorch and Arm SME2 - Jason Zhu, Arm
Skills:
AI Workflow Automation80%
Lightning Talk: Accelerating On-Device ML Inference With ExecuTorch and Arm SME2 - Jason Zhu, Arm
As on-device AI workloads grow in complexity, achieving low-latency inference within mobile power constraints remains a central challenge. We examine how ExecuTorch, combined with Arm’s Scalable Matrix Extension 2 (SME2), enables efficient CPU deployments of production AI workloads. We present a case study of SqueezeSAM, a segmentation model deployed in real-world mobile applications. Using ExecuTorch with XNNPACK delegation and SME2-optimized kernels, we evaluate INT8 and FP16 inference on a flagship smartphone. Moving beyond aggregate latency, we apply operator-level profiling to decompose runtime across convolution, GEMM, elementwise, and data movement operators, showing how hardware acceleration reshapes bottlenecks in the execution stack. SME2 delivers up to 3.9x end-to-end speedup on a single CPU core, materially altering runtime composition and revealing data movement as the primary post-acceleration bottleneck. This session presents a practical workflow for deploying, profiling, and systematically optimizing on-device PyTorch models, demonstrating how SME2 expands the viable design space for interactive mobile AI.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: AI Workflow Automation
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Mental Algorithms: How AI Changes the Cost of Thinking
Dev.to AI
The AI Content System I Built to Generate Viral LinkedIn Posts Started Bringing Clients…
Medium · Programming
$5,000/Month AI Income: Local Business Review Translation Service
Medium · ChatGPT
Gmail's New AI Features Are Live—And They're About to Change What You Actually See
Medium · Programming
🎓
Tutor Explanation
DeepCamp AI