Vision Transformer With Relation Exploration for Pedestrian Attribute Recognition

Published: 2025, Last Modified: 05 Nov 2025IEEE Trans. Multim. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Pedestrian attribute recognition has achieved high accuracy by exploring the relations between image regions and attributes. However, existing methods typically adopt features directly extracted from the backbone or utilize a single structure (e.g., transformer) to explore the relations, leading to inefficient and incomplete relation mining. To overcome these limitations, this paper proposes a comprehensive relationship framework called Vision Transformer with Relation Exploration (ViT-RE) for pedestrian attribute recognition, which includes two novel modules, namely Attribute and Contextual Feature Projection (ACFP) and Relation Exploration Module (REM). In ACFP, attribute-specific features and contextual-aware features are learned individually to capture discriminative information tailored for attributes and image regions, respectively. Then, REM employs Graph Convolutional Network (GCN) Blocks and Transformer Blocks to concurrently explore attribute, contextual, and attribute-contextual relations. To enable fine-grained relation mining, a Dynamic Adjacency Module (DAM) is further proposed to construct instance-wise adjacency matrix for the GCN Block. Equipped with comprehensive relation information, ViT-RE achieves promising performance on three popular benchmarks, including PETA, RAP, and PA-100 K datasets. Moreover, ViT-RE achieves the first place in the WACV 2023 UPAR Challenge.
Loading