Abstract: As interest grows in generating long, detailed image captions, existing automatic evaluation metrics are increasingly strained. N-gram-based metrics, though efficient, fail to capture semantic correctness, especially for longer outputs. Representational Similarity (RS) metrics, designed to address this, saw limited use due to high computational cost initially, and today, despite advances in hardware, they remain unpopular as they fall short even of weak baselines like BLEU. Meanwhile, metrics based on large language models (LLMs) show strong correlation with human judgments, but remain too expensive for use in model development. We introduce SPECS (Specificity-Enhanced CLIP-Score), a reference-free RS metric tailored for long image captioning. SPECS modifies CLIP with a new objective that emphasizes specificity: rewarding correct details and penalizing errors. We show that SPECS matches the performance of leading LLM-based metrics in correlating with human judgments, while being far more efficient. This makes it a practical alternative for frequent, low-cost evaluation during image captioning model development.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Vision Language Model, Dense image caption, Caption evaluation metric
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models
Languages Studied: English
Keywords: VLM, Dense image caption
Submission Number: 123
Loading