FIRM: Fusion-Injected Residual Memory Brings Token-Level Alignment to Unsupervised VI-ReID

Published: 01 Sept 2025, Last Modified: 18 Nov 2025ACML 2025 Conference TrackEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Unsupervised visible-infrared person re-identification (VI-ReID) presents unique challenges due to severe modality discrepancies, including heterogeneous appearance gaps, semantic granularity mismatches, and pseudo-label noise amplification intrinsic to label-free scenarios. We distill these challenges into two core problems: fine-grained semantic alignment, which necessitates explicit token-level cross-modal feature fusion, and memory fragmentation caused by noisy pseudo-label propagation. To address these issues, we propose Fusion-Injected Residual Memory (FIRM), a unified framework that integrates Vision–Semantic Prompt Fusion (VSPF), which injects multi-scale textual cues derived from CLIP and large language models into multiple layers of a vision backbone for token-wise semantic alignment, and Evolving Multi-view Cluster Memory (EMCM), which employs optimal transport–guided clustering and dynamic prototype maintenance to ensure long-term identity consistency. The framework is optimized end-to-end using an optimal transport–weighted InfoNCE loss, a multi-layer alignment regularizer, and geometric cluster regularization, all without reliance on manual annotations. Extensive experiments on benchmark VI-ReID datasets demonstrate that the proposed method substantially advances unsupervised cross-modal retrieval performance, achieving new state-of-the-art results. Ablation studies further verify the independent and synergistic effectiveness of both modules in overcoming the identified core challenges.
Submission Number: 270
Loading