Interpretable Composition Attribution Enhancement for Visio-linguistic Compositional Understanding

Interpretable Composition Attribution Enhancement for Visio-linguistic Compositional Understanding

ACL ARR 2024 June Submission5489 Authors

16 Jun 2024 (modified: 13 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Contrastively trained vision-language models such as CLIP have achieved remarkable progress in vision and language representation learning. Despite the promising progress, their proficiency in compositional reasoning over attributes and relations (eg, distinguishing between "the car is underneath the person" and "the person is underneath the car") remains notably inadequate. We investigate the cause for this deficient behavior is the composition attribution issue, where the attribution scores (eg, attention scores or GradCAM scores) for relations (eg, underneath) or attributes (eg, red) in text are substantially lower than those for object terms. In this work, we show such issue is mitigated via a novel framework called CAE (Composition Attribution Enhancement). This generic framework incorporates various interpretable attribution methods to encourages the model to pay greater attention on composition words denoting relationships and attributes within the text. Detailed analysis shows that our approach enables the models to adjust and rectify the attribution on the texts. Extensive experiments across seven benchmarks reveal that our framework significantly enhances the ability to discern intricate details and construct more sophisticated interpretations of combined visual and linguistic elements.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: compositional understanding, vision-language models, attribution tracing

Contribution Types: NLP engineering experiment

Languages Studied: english

Submission Number: 5489

Loading