LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

📰 ArXiv cs.AI

arXiv:2604.14140v1 Announce Type: cross Abstract: As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT

Published 16 Apr 2026

Read full paper → ← Back to Reads