Don’t learn, Ground. Image Generation for Grounded NLI

ACL ARR 2026 January Submission4626 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: natural language inference, synthetic data, textual entailment
Abstract: We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these with textual hypotheses. The pipeline achieves an accuracy comparable to text-based NLI classifiers while offering additional transparency. Our findings suggest that grounding language in vision is a viable and effective strategy for advancing robust natural language understanding.
Paper Type: Short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal content generation, vision question answering, multimodality
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 4626
Loading