DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

📰 ArXiv cs.AI

arXiv:2601.11895v3 Announce Type: replace-cross Abstract: DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real developer telemetry and synthesized using generator models from multiple provider families to mitigate single-source bias. Unlike prior benchmarks, it emphasizes ecological validity, avoids training da

Published 19 May 2026

Read full paper → ← Back to Reads