Keywords: Pedestrian attribute recognition, Parameter-Efficient Fine-Tuning, Mixture of expert
Abstract: Pedestrian attribute recognition aims to identify multiple semantic attributes of individuals from visual data, a task critical for surveillance applications. However, existing methods often overlook the heterogeneity of pedestrian attributes and lack mechanisms for effectively modeling inter-attribute relationships. This paper proposes VLAAR, a parameter-efficient fine-tuning method for pedestrian attribute recognition that leverages the mixture-of-experts framework. Building upon a pre-trained CLIP model, our approach employs lightweight expert modules, forming a pool of specialized networks. At its core, our dual-input routing mechanism concurrently processes visual features alongside semantic cues derived from natural language prompts, guiding expert selection effectively. This dynamic routing facilitates the optimal allocation and efficient processing of complex attribute information while preserving computational efficiency. Extensive evaluations on image and video benchmarks demonstrate state-of-the-art performance for multi-label attribute recognition in surveillance and re-identification systems.
Supplementary Material: pdf
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 5055
Loading