Multimodal LLMs for Context-Aware Human Activity Recognition in Aerial Surveillance

Mahule Roy; Subhas Roy

Multimodal LLMs for Context-Aware Human Activity Recognition in Aerial Surveillance

Mahule Roy, Subhas Roy

Published: 11 May 2026, Last Modified: 11 May 2026AERO-HPR 2026 PosterEveryoneRevisionsCC BY 4.0

Track: Non-Proceedings Track

Keywords: Aerial activity recognition, multimodal LLM, UAV metadata, video understanding, human behavior analysis, sensor fusion

TL;DR: We propose a multimodal LLM that fuses aerial video with UAV metadata, improving human activity recognition by up to 3.6% F1 and generating interpretable descriptions, highlighting the benefits of context-aware reasoning in aerial surveillance.

Abstract: We propose a multimodal LLM framework for aerial human activity recognition that fuses visual features with UAV sensor metadata (GPS, altitude, time). Using publicly available aerial benchmarks, including the large-scale UAV-Human dataset, our method achieves up to 3.6\% absolute F1 improvement over strong vision-only baselines while generating interpretable natural language descriptions. Metadata analysis reveals temporal context (+1.6\% F1) and altitude cues (+1.2\% F1) as most impactful. The 7B parameter model requires 2.8s per clip on an A100 GPU, with 8-bit quantization reducing memory by 45\% at a 1.7\% accuracy cost. We discuss computational–accuracy trade-offs and ethical considerations for surveillance applications.

Submission Number: 2

Loading