Seeing What’s Wrong: A Trajectory-Guided Approach to Caption Error Detection

Seeing What’s Wrong: A Trajectory-Guided Approach to Caption Error Detection

ICLR 2026 Conference Submission9807 Authors

Published: 26 Jan 2026, Last Modified: 26 Jan 2026ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Image-Caption Alignment, Error Detection, Caption Trajectory

TL;DR: We introduce TRACED, a model-agnostic, efficient, and interpretable framework that leverages caption trajectories to improve error detection on image–caption datasets

Abstract: Error detection is critical for enhancing multimodal dataset reliability and downstream model performance. Existing error filters, while increasingly powerful, typically rely on a single similarity score per image–caption pair. This is limiting: captions with subtle errors (e.g., mislabeled objects, incorrect colors, or negations) can still score highly, while correct but imprecisely worded captions may score poorly. To address this, we introduce the notion of a caption trajectory: an ordered sequence of captions produced by iteratively editing a caption to maximize an image-text relevance score. This trajectory carries rich signals for error detection. Correct captions typically stabilize after minor edits, while erroneous captions undergo substantial improvements. Building on these insights, we introduce TRACED, a cost-efficient and model-agnostic framework that leverages trajectory statistics for more accurate caption error detection. Beyond detection, TRACED also serves as an interpretable tool for identifying the origins of errors. We further demonstrate that, in the case of error correction, these interpretable token-level error information can be provided to VLMs to enhance the alignment scores of the generated captions. On MS COCO and Flickr30k, TRACED achieves up to 2.8% improvement in accuracy for error detection across three noise types.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 9807

Loading