InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis
📰 ArXiv cs.AI
arXiv:2604.13201v1 Announce Type: cross Abstract: Large language models are emerging as scientific assistants, but evaluating their ability to reason from empirical data remains challenging. Benchmarks derived from published studies and human annotations inherit publication bias, known-knowledge bias, label noise, and substantial storage requirements. We present InfiniteScienceGym, a procedurally generated benchmark of scientific repositories paired with a verifiable question-answering task. Fro
DeepCamp AI