Scale-Aware Vision-Language Adaptation for Extreme Far-Distance Video Person Re-identification

Published: 11 May 2026, Last Modified: 11 May 2026AERO-HPR 2026 PosterEveryoneRevisionsCC BY 4.0
Track: Proceedings Track
Keywords: Person Re-identification, Aerial–Ground ReID, Cross-camera matching
Abstract: Extreme far-distance video person re-identification (ReID) is particularly challenging due to scale compression, resolution degradation, motion blur, and aerial-ground viewpoint mismatch. As camera altitude and subject distance increase, models trained on close-range imagery degrade significantly. In this work, we investigate how large-scale vision-language models can be adapted to operate reliably under these conditions. Starting from a CLIP-based baseline, we upgrade the visual backbone from ViT-B/16 to ViT-L/14 and introduce backbone-aware selective fine-tuning to stabilize adaptation of the larger transformer. To address noisy and low-resolution tracklets, we incorporate a lightweight temporal attention pooling mechanism that suppresses degraded frames and emphasizes informative observations. We retain adapter-based and prompt-conditioned cross-view learning to mitigate aerial–ground domain shifts, and further refine retrieval using improved optimization and k-reciprocal re-ranking. Experiments on the DetReIDX stress-test benchmark show that our approach achieves an average mAP of $35.73$ across aerial-ground (A2G), ground-aerial (G2A), and aerial-aerial (A2A) protocols, improving substantially over the existing baseline. These results show that large-scale vision-language backbones, when combined with stability-focused adaptation, significantly enhance robustness in extreme far-distance video person ReID.
Supplementary Material: pdf
Submission Number: 3
Loading