Lightning Talk: Accelerating On-Device ML Inference With ExecuTorch and Arm SME2 - Jason Zhu, Arm

Name: Lightning Talk: Accelerating On-Device ML Inference With ExecuTorch and Arm SME2 - Jason Zhu, Arm
Uploaded: 2026-04-20T20:21:45Z
Channel: PyTorch
Description: Lightning Talk: Accelerating On-Device ML Inference With ExecuTorch and Arm SME2 - Jason Zhu, Arm As on-device AI workloads grow in complexity, achievin...

PyTorch · Intermediate ·🛠️ AI Tools & Apps ·3w ago

Skills: AI Workflow Automation80%

Lightning Talk: Accelerating On-Device ML Inference With ExecuTorch and Arm SME2 - Jason Zhu, Arm As on-device AI workloads grow in complexity, achieving low-latency inference within mobile power constraints remains a central challenge. We examine how ExecuTorch, combined with Arm’s Scalable Matrix Extension 2 (SME2), enables efficient CPU deployments of production AI workloads. We present a case study of SqueezeSAM, a segmentation model deployed in real-world mobile applications. Using ExecuTorch with XNNPACK delegation and SME2-optimized kernels, we evaluate INT8 and FP16 inference on a flagship smartphone. Moving beyond aggregate latency, we apply operator-level profiling to decompose runtime across convolution, GEMM, elementwise, and data movement operators, showing how hardware acceleration reshapes bottlenecks in the execution stack. SME2 delivers up to 3.9x end-to-end speedup on a single CPU core, materially altering runtime composition and revealing data movement as the primary post-acceleration bottleneck. This session presents a practical workflow for deploying, profiling, and systematically optimizing on-device PyTorch models, demonstrating how SME2 expands the viable design space for interactive mobile AI.

Watch on YouTube ↗ (saves to browser)