Don't Learn, Ground: A Case for Natural Language Inference with Visual Grounding

Daniil Ignatev, Ayman Santeer, Albert Gatt, Denis Paperno

Published: 2025, Last Modified: 04 Mar 2026CoRR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these representations with textual hypotheses. We evaluate two inference techniques: cosine similarity and visual question answering. Our method achieves high accuracy without task-specific fine-tuning, demonstrating robustness against textual biases and surface heuristics. Additionally, we design a controlled adversarial dataset to validate the robustness of our approach. Our findings suggest that leveraging visual modality as a meaning representation provides a promising direction for robust natural language understanding.

External IDs:dblp:journals/corr/abs-2511-17358