Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

📰 ArXiv cs.AI

arXiv:2604.11465v1 Announce Type: new Abstract: Large language model (LLM) agents show promise on realistic tool-use tasks, but deploying capable agents on modest hardware remains challenging. We study whether inference-time scaffolding alone, without any additional training compute, can improve the performance of a small model in complex multi-step environments. Operating on a single 24\,GB GPU, we evaluate Qwen3-8B under both full-precision (FP16, 12K context) and 4-bit quantized (AWQ, 32K con

Published 14 Apr 2026

Read full paper → ← Back to Reads