Keywords: vision-language, robustness, caption, benchmark, mllm
TL;DR: We introduce a benchmark and method for detecting a challenging class of captioning errors termed systematic misalignments.
Abstract: Multimodal large language models (MLLMs) often introduce errors when generating image captions, resulting in misaligned image-text pairs. Our work focuses on a class of captioning errors that we refer to as systematic misalignments, where a recurring error in MLLM-generated captions is closely associated with the presence of a specific visual feature in the paired image. Given a vision-language dataset with MLLM-generated captions, our aim in this work is to detect such errors, a task we refer to as systematic misalignment detection. As our first key contribution, we introduce SymbalBench, the first benchmark designed to evaluate automated methods for identifying systematic misalignments. SymbalBench consists of 420 vision-language datasets from two domains (natural images and medical images) with annotated systematic misalignments. As our second key contribution, we present Symbal, which utilizes a structured, dual-stage setup with off-the-shelf foundation models to identify such errors and summarize results in natural language. Symbal exhibits strong performance on SymbalBench, correctly identifying systematic misalignments in 63.8% of datasets, a nearly 4x improvement over the closest baseline. We supplement our evaluations on SymbalBench with real-world evaluations, showing that Symbal can identify systematic misalignments in captions generated by an off-the-shelf MLLM. Ultimately, our novel task, benchmark, and method can aid users in auditing MLLM-generated captions and identifying critical failure modes, without requiring access to the underlying MLLM.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 14974
Loading