Keywords: Multimodal Idiom Understanding, Potentially Idiomatic Expressions (PIEs), Zero-Shot Learning, Large Vision-Language Models (LVLMs), Chain-of-Thought Prompting
Working Group: WG3: Multilingual and cross-lingual language technology
Abstract: This paper presents our system for AdMIRe
2 (Advancing Multimodal Idiomaticity Repre-
sentation), a shared task on multilingual multi-
modal idiom understanding. The task focuses
on ranking images according to how well they
depict the literal or idiomatic usage of poten-
tially idiomatic expressions (PIEs) in context,
across 15 languages and two tracks: a text-only
track, and a multimodal track that uses both
images and captions. To tackle both tracks, we
propose a hybrid zero-shot pipeline built on
large vision–language models (LVLMs). Our
system employs a chain-of-thought prompting
scheme that first classifies each PIE usage as
literal or idiomatic and then ranks candidate
images by their alignment with the inferred
meaning. A primary–fallback routing mech-
anism increases robustness to safety-filter re-
fusals, while lightweight post-processing recov-
ers consistent rankings from imperfect model
outputs. Without any task-specific fine-tuning,
our approach achieves 55.9% Top-1 Accuracy
in the text-only track and 60.1% in the multi-
modal (text+image) track, ranking first overall
on the official leaderboard. These results sug-
gest that carefully designed zero-shot LVLM
pipelines can provide strong baselines for mul-
tilingual multimodal idiomaticity benchmarks.
WG3 Tasks: Task 3.5 Evaluation campaign: AdMIRe - Advancing Multimodal Idiomaticity Representation
Tracks For Type Of Contribution: Complete work (including previously published work)
Do You Need Visa To Attend The 4th UniDive General Meeting In Romania: Yes
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 60
Loading