On-Device LLM Inference via KMP and llama.cpp

📰 Dev.to · SoftwareDevs mvpfactory.io

Build a KMP shared module that wraps llama.cpp through cinterop (iOS) and JNI (Android), covering mmap-based model loading to avoid OOM kills, hardware accelerator delegation (Apple Neural Engine via CoreML, Android NNAPI/GPU delegate), quantization format tradeoffs (Q4_K_M vs Q5_K_S for mobile DRAM constraints), thermal throttling detection with adaptive token generation rates, and structured output parsing for app-integrated AI features — with real profiling data comparing on-device latency, memory pressure, and battery drain across Pixel 8 and iPhone 15 Pro

Published 2 Apr 2026
Read full article → ← Back to Reads