Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight

📰 ArXiv cs.AI

arXiv:2512.19691v3 Announce Type: replace Abstract: Reference labels for machine-learning benchmarks are increasingly synthesized with LLM assistance, but their reliability remains underexamined. We audit MedCalc-Bench, a clinical benchmark for medical score computation whose labels were partly derived with LLM assistance, and develop a scalable physician-in-the-loop stewardship pipeline to reassess them. At least 27% of test labels are likely erroneous or incomputable. On a 50-instance subset v

Published 14 Apr 2026

Read full paper → ← Back to Reads