Dual-Stream Multimodal Person Re-Identification Under Overhead Surveillance: RGB and Depth Perspectives

Md Rashidunnabi; Hugo Proenca; João C. Neves; Vasco Lopes; Kailash A. Hambarde

Dual-Stream Multimodal Person Re-Identification Under Overhead Surveillance: RGB and Depth Perspectives

Md Rashidunnabi, Hugo Proenca, João C. Neves, Vasco Lopes, Kailash A. Hambarde

Published: 11 May 2026, Last Modified: 11 May 2026AERO-HPR 2026 PosterEveryoneRevisionsCC BY 4.0

Track: Non-Proceedings Track

Keywords: Person Re-Identification, RGB-D, Overhead Surveillance, Multimodal Learning, Temporal Attention, Cross-Modal Alignment

TL;DR: OV-ReID is a dual-stream RGB-depth person re-identification model for overhead surveillance that jointly learns modality-specific and shared features, outperforming the TVRID baseline on both RGB and depth retrieval.

Abstract: Identifying the same person across multiple cameras is a core task in video surveillance. When cameras are mounted overhead, such as pointing downward, upward, or even inverted, the task becomes much more difficult because a person’s appearance changes significantly with viewing angle. We present OV-ReID (Overhead-View Re-Identification), a system designed to recognize people from both color (RGB) and depth video captured by such overhead cameras. OV-ReID uses two parallel neural networks, one for each sensor type, trained jointly so that both produce compatible identity representations. We evaluate the method on TVRID, the benchmark dataset of the ICPR 2026 Top-View RGB-Depth Re-ID Challenge, which contains 62 people recorded from four camera angles, including upward- and downward-facing setups that closely resemble aerial surveillance conditions. OV-ReID achieves 99.5% mAP on RGB and 94.3% mAP on depth, outperforming the official competition baseline by +18.9 mAP points on RGB (from 80.6%) and +46.9 mAP points on depth (from 47.4%). The code and models are publicly available on GitHub for both the RGB and depth versions. Remarkably, in the most aerial-like setting, where matching is performed between an upward-facing and a downward-facing camera, the system achieves perfect 100% recognition accuracy on RGB.

Submission Number: 14

Loading