Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling

📰 ArXiv cs.AI

Cactus accelerates auto-regressive decoding with constrained acceptance speculative sampling for large language models

advanced Published 8 Apr 2026

Action Steps

Implement speculative sampling with smaller draft models
Apply constrained acceptance criteria to allow for slight variations in generated distributions
Use techniques like top-$k$ or temperature sampling to improve acceptance rates
Evaluate and refine the Cactus approach for specific use cases and models

Who Needs to Know This

ML researchers and engineers on a team can benefit from Cactus to improve decoding efficiency, while working with large language models

Key Insight

💡 Constrained acceptance speculative sampling can improve decoding efficiency without sacrificing accuracy