Track: Non-Proceedings Track
Keywords: Aerial activity recognition, multimodal LLM, UAV metadata, video understanding, human behavior analysis, sensor fusion
TL;DR: We propose a multimodal LLM that fuses aerial video with UAV metadata, improving human activity recognition by up to 3.6% F1 and generating interpretable descriptions, highlighting the benefits of context-aware reasoning in aerial surveillance.
Abstract: We propose a multimodal LLM framework for aerial human activity recognition that fuses visual features with UAV sensor metadata (GPS, altitude, time). Using publicly available aerial benchmarks, including the large-scale UAV-Human dataset, our method achieves up to 3.6\% absolute F1 improvement over strong vision-only baselines while generating interpretable natural language descriptions. Metadata analysis reveals temporal context (+1.6\% F1) and altitude cues (+1.2\% F1) as most impactful. The 7B parameter model requires 2.8s per clip on an A100 GPU, with 8-bit quantization reducing memory by 45\% at a 1.7\% accuracy cost. We discuss computational–accuracy trade-offs and ethical considerations for surveillance applications.
Submission Number: 2
Loading