Mix background and foreground separately: Transformer-based Augmentation Strategies for Domain Generalization

Zhongqiang Zhang, Fuhan Cai, Duo Liu, Ge Liu, Xiangzhong Fang

Published: 01 Jan 2024, Last Modified: 13 Nov 2024ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Domain generalization (DG) aims to alleviate the severe performance degradation of deep neural networks when a domain shift exists between training and testing data. The different distributions of the backgrounds of samples represent a vital factor contributing to the domain gap. Inspired by the augmentation-based method, we aim to mix up the backgrounds of samples to generate new images for training. The Mixup method, a classical approach, randomly blends two samples at the image level. Therefore, it cannot resolve the causal relationship between backgrounds and semantic labels. To solve this problem, we introduce a new method that separates the foreground and background of samples at the patch level using a Vision Transformer network (ViT). Concretely, we calculate attention scores of each patch based on self-attention modules in ViT. Then, we identify the background or foreground by the rank of attention scores. Next, we present a Background-Mix method to blend the common background of samples from different domains, regularizing the model to ignore causal relations between background and semantic label. Moreover, the different appearances of objects across distinct source domains also contribute to the performance drop on the target domain. Therefore, we design a Foreground-Mix method to mix only the object part, excluding the background. This can enable the model to classify objects in multiple patterns of representation. The entire framework is collectively referred to as BFMix. Extensive experiments demonstrate that our method achieves state-of-the-art performance.