Part-pixel transformer with smooth alignment fusion for domain adaptation person re-identification

Jun Kong, Hua Zhou, Min Jiang, Tianshan Liu

Published: 01 Jan 2024, Last Modified: 13 May 2025Signal Image Video Process. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The method of generating pseudo-labels by clustering is proved to be effective in unsupervised domain adaptation (UDA) person re-identification (re-ID). However, the pseudo-labels contain a lot of noise, which hinders the further improvement of the performance of the model. Extracting representative features is the key to solve the above problem. In this paper, we propose the Part-Pixel Transformer with Smooth Alignment Fusion Network (PTFNet) to capture richer discriminative pedestrian features. Specifically, we design a Part-Pixel Transformer (PPformer) to model the long-range dependence between features, which adopts the horizontal splitting method to obtain horizontal parts with more highly correlated regions of the image. At the same time, the interaction of pixel-level information is further captured in each horizontal part. In addition, we also propose a Smooth Alignment Fusion (SAF) module, which is composed of Smooth Alignment block (SA-Block) and Cross-layer Fusion block (CF-Block). Firstly, the cross-layer features are smoothed by SA-Block to reduce the semantic gap between the features of different layers. Then, it is fed into the CF-Block to complete the aggregation of low-level features with spatial information and high-level features with semantic information. Extensive experiments show that our proposed methods can significantly surpass the performance of previous works on UDA tasks for person re-ID.