Don’t learn, Ground. Image Generation for Grounded NLI

Don’t learn, Ground. Image Generation for Grounded NLI

ACL ARR 2026 January Submission4626 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: natural language inference, synthetic data, textual entailment

Abstract: We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these with textual hypotheses. The pipeline achieves an accuracy comparable to text-based NLI classifiers while offering additional transparency. Our findings suggest that grounding language in vision is a viable and effective strategy for advancing robust natural language understanding.

Paper Type: Short

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: cross-modal content generation, vision question answering, multimodality

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 4626

Loading