Benchmarking Multimodal Idiomaticity: Tasks and Methods for Idiomatic Language Understanding in Text and Images

Benchmarking Multimodal Idiomaticity: Tasks and Methods for Idiomatic Language Understanding in Text and Images

ACL ARR 2024 December Submission2220 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In this paper, we present a dataset containing images and texts representing potentially idiomatic expressions in two languages, English and Portuguese. The expressions were selected for their potential of ambiguity between a literal and an idiomatic sense, and they are represented as static images, or as image sequences, to capture the more abstract cases or temporally dependent cases. To investigate how well models handle idiomatic expressions and integrate cues from different modalities (textual and visual/visual-temporal data), we propose two tasks to examine how mono and multimodal representations perform: multiple choice image selection and next image prediction task. Using a new metric that we propose for graded relevance, Normalized Discounted Cumulative Gain, the results obtained by representative models indicate that multimodal generative models, using our framework, outperform traditional vision-and-language models in comprehending idiomatic expressions by effectively integrating visual and textual information.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: corpus creation; benchmarking; multilingual corpora; automatic evaluation of datasets

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English, Protuguese

Submission Number: 2220

Loading