SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation

SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation

ACL ARR 2025 May Submission123 Authors

07 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: As interest grows in generating long, detailed image captions, existing automatic evaluation metrics are increasingly strained. N-gram-based metrics, though efficient, fail to capture semantic correctness, especially for longer outputs. Representational Similarity (RS) metrics, designed to address this, saw limited use due to high computational cost initially, and today, despite advances in hardware, they remain unpopular as they fall short even of weak baselines like BLEU. Meanwhile, metrics based on large language models (LLMs) show strong correlation with human judgments, but remain too expensive for use in model development. We introduce SPECS (Specificity-Enhanced CLIP-Score), a reference-free RS metric tailored for long image captioning. SPECS modifies CLIP with a new objective that emphasizes specificity: rewarding correct details and penalizing errors. We show that SPECS matches the performance of leading LLM-based metrics in correlating with human judgments, while being far more efficient. This makes it a practical alternative for frequent, low-cost evaluation during image captioning model development.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Vision Language Model, Dense image caption, Caption evaluation metric

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models

Languages Studied: English

Keywords: VLM, Dense image caption

Submission Number: 123

Loading