Enhancing Semantic Understanding in Vision Language Models Using Meaning Representation Negative Generation

Ziyi Shou; Fangzhen Lin

Enhancing Semantic Understanding in Vision Language Models Using Meaning Representation Negative Generation

Ziyi Shou, Fangzhen Lin

Published: 29 Jun 2024, Last Modified: 07 Jul 2024KiL 2024 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Language Models, Semantic Understanding, Compositional Understanding, Abstract Meaning Representation

Abstract: Vision language models have been criticized for their performance resembling bag-of-words models, lacking semantic understanding. Efforts to address this concern have included the integration of composition aware negative samples into contrastive learning methodologies. However, current negative generation methods show restricted semantic comprehension, diversity, and fluency. To tackle this issue, we propose leveraging Abstract Meaning Representation (AMR), a representation of considerable interest in natural language processing research, for negative sample generation. By altering the structure of the meaning representation, we create negative samples with entirely different meanings but share close plain paraphrases. These negatives, generated using AMR, are then incorporated alongside token swap negatives during contrastive training. Our results indicate that AMR generated negatives introduce significantly diverse patterns. Furthermore, the inclusion of AMR generated negative samples enhances the models’ performance across a range of compositional understanding tasks.

Submission Number: 6

Loading