CAF-Score: Calibrating CLAP with LALMs for Reference-free Audio Captioning Evaluation

📰 ArXiv cs.AI

CAF-Score is a reference-free metric for evaluating audio captioning using Large Audio-Language Models (LALMs) and Contrastive Language-Audio Pretraining (CLAP)

advanced Published 23 Mar 2026

Action Steps

Identify the limitations of reference-based metrics in evaluating audio captioning
Understand how CLAP-based approaches can overlook syntactic errors and fine-grained details
Implement CAF-Score to calibrate CLAP's coarse-grained semantic alignment with fine-grained details using LALMs
Evaluate the performance of CAF-Score in reference-free audio captioning evaluation

Who Needs to Know This

AI engineers and researchers working on audio captioning tasks can benefit from CAF-Score as it provides a more robust evaluation metric, while product managers can use it to improve the overall quality of audio captioning systems

Key Insight

💡 CAF-Score provides a more robust evaluation metric for audio captioning by calibrating CLAP's semantic alignment with fine-grained details using LALMs