CLIPping Imbalances: A Novel Evaluation Baseline and PEARL Dataset for Pedestrian Attribute Recognition
Abstract: Pedestrian Attribute Recognition (PAR) serves as a fundamental task in computer vision and is crucial for upgrading security systems. It helps in precisely identifying and
characterizing various attributes of pedestrians. However,
current PAR datasets have certain issues in representing
a wide range of attributes correctly, which makes the existing PAR methods less effective in real-world scenarios.
Addressing this limitation, this paper introduces PEARL,
a comprehensive dataset comprising of diverse pedestrian
images annotated with 146 attributes. These samples have
been sourced from surveillance videos across twelve countries. This paper also formulates an image-based PAR using
language-image fusion strategy and utilizes CLIP as a new
evaluation baseline. Specifically, we leverage textual information by transforming sets of attributes into meaningful
sentences. Addressing the inherent data imbalance in PAR,
we provide three types of prompt settings to optimize the
training of the CLIP model. Our evaluation encompasses a
thorough assessment of the proposed baseline model across
various datasets, including PEARL dataset as well as established PAR benchmarks such as PA100K, RAP, and PETA.
Loading