Abstract: Previous works in Multi-modal Data Augmentation (DA) utilize an overall multi-modal mixing approach to improve data efficiency on general downstream tasks. However, fusing modalities as a whole tends to degrade cross-modal fine-grained cues in Text-based Person Search (TPS), potentially compromising the effectiveness of augmentation strategies. To address this issue, we propose a more appropriate data augmentation method for TPS, named Partitional Semantic Mixup Method (PaSeMix), which establishes semantic correspondences between visual and textual modalities to enable fine-grained multi-modal mixing. Specifically, PaSeMix leverages a semantic dictionary to map the image parts and attribute words, and then performs local linear interpolation of image parts and local words replacement respectively under this relationship to generate augmented image-text pairs. PaSeMix introduces tighter semantic relationship to the generated samples and encourages the model to learn the alignment of multi-modal data. In addition, we design a suite of cross-domain datasets by resplitting available datasets and conducting cross-domain retrieval experiments. The results lend credence to the adaptability of our approach across different distributions, improving the Top-1 accuracy by 0.81%.
Loading