CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms
📰 ArXiv cs.AI
arXiv:2604.10825v1 Announce Type: new Abstract: We introduce CheeseBench, a benchmark that evaluates large language models (LLMs) on nine classical behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maze, operant chamber, shuttle box, conditioned place preference, and delayed non-match to sample), spanning six cognitive dimensions. Each task is grounded in peer-reviewed rodent protocols with approximate animal baselines. The agent receives a unified
DeepCamp AI