Accelerating Vision Transformers with Adaptive Patch Sizes

📰 ArXiv cs.AI

arXiv:2510.18091v2 Announce Type: replace-cross Abstract: Vision Transformers (ViTs) partition input images into uniformly sized patches regardless of their content, resulting in long input sequence lengths for high-resolution images. We present Adaptive Patch Transformers (APT), which addresses this by using multiple different patch sizes within the same image. APT reduces the total number of input tokens by allocating larger patch sizes in more homogeneous areas and smaller patches in more com

Published 25 Apr 2026

Read full paper → ← Back to Reads