Keywords: Multimodal Large Language Models, Benchmark, Vision and Language
Abstract: Vision and Language are two major modalities in Artificial Intelligence research.
Bridging the gap between these modalities has long been a key focus in the multimodal community.
Inspired by human cognition, we believe that if a model can see an image and directly associate it with its linguistic meaning, the model possesses high-level intelligence that spans vision and language.
In our work, we focus on emojis in images, a widely-used "cryptic symbol", with a data form of both visual and linguistic features, i.e. emojis have the specific textual semantics while human understand the meaning from their visual information.
Specifically, we first propose the novel task of translating emojis in images to corresponding idioms, thereby challenging Multimodal Large Language Models (MLLMs) to (1) understand the semantic correlation between language and emojis, and (2) reason the intricate linguistic meaning from the emojis in images.
To facilitate the advancement of this task, we construct a high-quality benchmark (emoji2idiom) following the process of automatic model generation and human manual filtering.
Based on our constructed emoji2idiom, we employ multiple advanced MLLMs to conduct extensive experiments and detailed analyses, demonstrating that existing MLLMs do not yet have enough capability to understand and reason the linguistic information from visual data.
We believe our proposed benchmark and interesting discoveries will encourage the community to attach importance to the intelligence of MLLMs directly associating language from vision, to give MLLMs more comprehensive vision-language understanding ability.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6776
Loading