ICC: Quantifying Image Caption Concreteness for Multimodal Dataset CurationDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
TL;DR: A new method for determining the concreteness of captions, that improves multimodel dataset curation.
Abstract: Web-scale training on paired text-image data is becoming increasingly central in multimodal learning, but is challenged by the highly noisy nature of datasets in the wild. Standard data filtering approaches succeed in removing mis matched text-image pairs (e.g. via CLIP similarity), but permit semantically related but highly abstract text. In this work, we propose a new metric, ICC, that evaluates caption text without an image reference to measure its concreteness and relevancy for use in multimodal learning. To calculate this metric, we leverage strong foundation models for measuring visual-semantic information loss in multimodal representations. We show that curation using ICC complements existing approaches and succeeds in distilling multimodal web-scale datasets for more efficient and effective learning. Moreover, we show that ICC strongly correlates with human evaluation of caption concreteness quality.
Paper Type: short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English
0 Replies

Loading