AEA-FIRM: Adaptive Elastic Alignment With Fine-Grained Representation Mining for Text-Based Aerial Pedestrian Retrieval
Abstract: Autonomous aerial vehicles (AAVs) have garnered significant attention due to their operational flexibility, enabling expanded application scenarios across diverse fields. The Text-Based Pedestrian Retrieval (TBPR) task aims to identify corresponding images from textual descriptions, yet existing research has primarily focused on ground-level views. To broaden the applicability of TBPR systems, we introduce aerial-view analysis and propose a novel Text-Based Aerial Pedestrian Retrieval (TBAPR) task. This task introduces unique challenges, particularly the dual gaps in cross-view (aerial vs. ground) and cross-modal (text vs. image) matching, which are more complex than traditional TBPR or aerial-ground pedestrian understanding tasks. To address these challenges, we propose an Adaptive Elastic Alignment Network with FIne-Grained Representation Mining (AEA-FIRM). Our framework tackles the cross-view gap through an AEA loss that adaptively prioritizes critical semantic features while dynamically aligning textual and aerial semantics under challenging conditions. Concurrently, the FIRM module refines visual-linguistic representations by mining fine-grained pedestrian attributes and explicitly textualizing them for cross-modal matching verification. Extensive experiments demonstrate that AEA-FIRM achieves state-of-the-art performance, outperforming existing TBPR methods by 4.87% in Rank-1 accuracy. Our code and dataset are available at https://github.com/xbdxwyh/AEA-FIRM-main.git
External IDs:doi:10.1109/tcsvt.2025.3586601
Loading