Keywords: UAV Control, Reinforcement Learning, Open Vocabulary Detection, Vision Language Models
Abstract: Autonomous UAVs operating in dynamic environments face significant challenges when tasked with tracking arbitrary targets described in natural language, particularly in unstructured scenarios where traditional closed-world tracking systems fail. This work presents a novel open-vocabulary UAV target tracking framework that integrates vision-language models with classical tracking algorithms to enable real-time reactive control in dynamic mountainous environments. Our approach combines OWL-ViT for zero-shot object detection, CSRT for efficient tracking, and a hybrid control architecture featuring gimbal-based localization for distant targets and reinforcement learning-assisted visual servoing for precise following. The RL-adapted PD controller demonstrates robust performance across varying target velocities where traditional PID controllers fail, addressing the critical need for real-time reactivity and smooth trajectory generation in unpredictable conditions. We validate our framework in AirSim's mountainous terrain with configurable vehicle dynamics, demonstrating stable tracking performance despite challenging viewing angles and environmental disturbances. While CSRT tracking remains stable, OWL-ViT detection occasionally fails in complex terrains, highlighting the ongoing challenges of perception in dynamic environments. Our modular architecture enables natural language target specification without predefined object classes, contributing to more adaptable and trustworthy robotic systems for search-and-rescue and surveillance applications in dynamic environments.
Submission Number: 32
Loading