VLAAR: Vision-Language Attribute-Aware Router for Pedestrian Attribute Recognition

Lam Nguyen; Minh Kha Do

VLAAR: Vision-Language Attribute-Aware Router for Pedestrian Attribute Recognition

Lam Nguyen, Minh Kha Do

14 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Pedestrian attribute recognition, Parameter-Efficient Fine-Tuning, Mixture of expert

Abstract: Pedestrian attribute recognition aims to identify multiple semantic attributes of individuals from visual data, a task critical for surveillance applications. However, existing methods often overlook the heterogeneity of pedestrian attributes and lack mechanisms for effectively modeling inter-attribute relationships. This paper proposes VLAAR, a parameter-efficient fine-tuning method for pedestrian attribute recognition that leverages the mixture-of-experts framework. Building upon a pre-trained CLIP model, our approach employs lightweight expert modules, forming a pool of specialized networks. At its core, our dual-input routing mechanism concurrently processes visual features alongside semantic cues derived from natural language prompts, guiding expert selection effectively. This dynamic routing facilitates the optimal allocation and efficient processing of complex attribute information while preserving computational efficiency. Extensive evaluations on image and video benchmarks demonstrate state-of-the-art performance for multi-label attribute recognition in surveillance and re-identification systems.

Supplementary Material: pdf

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 5055

Loading