Abstract: Pedestrian attribute recognition, as a sub-task of multi-label classification, is influenced by the joint and coupled relationships among the numerous labels of the samples. This impact affects the model’s ability to judge labels and results in a loss of recognition accuracy for both overall and individual labels in unfamiliar scenes. In this paper, we propose a Disentangled Attribute Features Vision Transformer (DAF-ViT) model for pedestrian attribute recognition, aiming to enhance the discriminative ability of correlated features in pedestrian samples and suppress the discriminative ability of irrelevant features. The decoupling model consists of three parallel self-attention modules: a correlation self-attention module (CSA) for extracting correlated attribute features, a exclusive self-attention module (ESA) for extracting mutually exclusive attribute features, and a random sequence self-attention module (RSA) for extracting unrelated attribute features. The final predicted attributes are obtained by combining the results of these modules using a multi-level result fusion module. Compared to the baseline methods, our approach achieves a 2.8\(\%\) and 2.5\(\%\) increase in mA accuracy on the PETA and PA-100K datasets, respectively. This effectively improves the average accuracy of the model on various attributes.
Loading