Towards Reporting Bias in Visual-Language Datasets: Bi-Modal Data Augmentation by Decoupling Object-Attribute Association

Qiyu Wu, Mengjie Zhao, Yutong He, Lang Huang, Junya Ono, Hiromi Wakaki, Yuki Mitsufuji

Published: 2025, Last Modified: 17 Apr 2026ICCVW 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Reporting bias occurs when people assume universal understanding and omit explicit details. In this paper, we fo-cus on the wide existence of object-attribute association in vision-language datasets, which is caused by reporting bias and can consequentially degrade models trained on them. To mitigate this, we propose a bi-modal augmentation (BiAug) approach through object-attribute decoupling to flex-ibly synthesize vision-language examples with a rich array of object-attribute pairing, and through constructing cross-modal hard negative vision-language examples. First, Bi-Aug decouples object-attribute associations. Cross-modal verified object candidates are extracted, followed by gen-eration of contradictive attributes of the candidates. Sec-ond, BiAug synthesizes hard negative vision-language ex-amples. Objects with generated attributes are integrated into both the image and the caption through an image in-painting model and a large language model, respectively. By finishing the two steps, the synthesized examples ex-plicitly complement the omitted objects and attributes of the original examples; the hard negative examples steer the model to distinguish various attributes for an identical object. Extensive experiments and analysis demonstrated that the model trained with our augmented dataset excels in object-attribute comprehension.