CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

📰 ArXiv cs.AI

arXiv:2604.10825v1 Announce Type: new Abstract: We introduce CheeseBench, a benchmark that evaluates large language models (LLMs) on nine classical behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maze, operant chamber, shuttle box, conditioned place preference, and delayed non-match to sample), spanning six cognitive dimensions. Each task is grounded in peer-reviewed rodent protocols with approximate animal baselines. The agent receives a unified

Published 14 Apr 2026
Read full paper → ← Back to Reads