VEglue: Testing Visual Entailment Systems via Object-Aligned Joint Erasing

Zhiyuan Chang, Mingyang Li, Junjie Wang, Cheng Li, Qing Wang

Published: 2026, Last Modified: 08 Apr 2026ACM Trans. Softw. Eng. Methodol. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Visual entailment (VE) is a multimodal reasoning task consisting of image-sentence pairs whereby a promise is defined by an image, and a sentence describes a hypothesis. The goal is to predict whether the image semantically entails the sentence. VE systems have been widely adopted in many downstream tasks such as image caption and visual question answering. However, the robustness of VE systems still faces significant challenges. One of the reasons is that the VE system suffers object-confusing defect when some similar objects exist. It outputs a positive prediction inferred by an erroneous object relationship, which will result in a fault negative prediction if the noised object does not exist.Previous approaches generate tests primarily relied on some general perturbations, such as simulating noise or weather interference in images, or substituting synonyms or rewriting sentences in texts. To test the object-confusing defect in VE systems, it requires perceiving and understanding key objects and entities and maintain the semantic relevance between cross-modal inputs, making it challenging to generate effective tests with high quality. Therefore, we propose VEglue, an object-aligned joint erasing approach for VE systems testing. It first aligns the object regions in the premise and object descriptions in the hypothesis to identify linked and un-linked objects. Then, based on the alignment information, three metamorphic relations are designed to jointly erase the objects of the two modalities. We evaluate VEglue on four widely used VE systems involving two public datasets, and the results demonstrate that VEglue could detect 11,609 issues on average with a 52.5% Issue Finding Rate (IFR). Furthermore, we leverage the tests generated by VEglue to retrain the VE systems, which largely improves model performance (50.8% increase in accuracy) on newly generated tests without sacrificing the accuracy on the original test set.