Tri-Agent Driving: Learning to Coordinate Agents via Scenario Complexity Representation for Efficient Autonomous Driving
Keywords: Scenario Complexity-Aware, Vision-Language Models, Autonomous Driving
Abstract: End-to-End (E2E) autonomous driving systems face a fundamental dilemma that fast traditional models offer low latency but struggle with complex and ambiguous scenarios, while Vision-Language Models based systems provide powerful contextual understanding at the cost of high computational overhead. Instead of pursuing a single faster or more powerful model, we present Tri-Agent Driving (TAD), a dynamic framework that learns to select the most appropriate agent on-the-fly based on scenario complexity, directly from raw multi-view camera inputs. This learned representation serves as a routing signal to enable real-time activation of the optimal agent, balancing computational efficiency and reasoning depth on demand. TAD integrates three complementary agents: a Fast Agent optimized for low-complexity and routine scenarios, a Smart Agent for medium-complexity scenes and a Deep Thinking Agent enhanced with Chain-of-Thought (CoT) reasoning for high-complexity corner cases. The core of TAD lies in the trainable Agent Coordination module, which proactively predicts scenario complexity and triggers agent switching without human intervention. On a challenging hybrid test set spanning diverse traffic conditions, TAD achieves state-of-the-art trajectory prediction, while reducing average inference latency by 26\% (4.2s vs. 5.7s) and GPU memory consumption by 30\% (15.4 GB vs. 22 GB) compared to the strongest VLM-based model. This ``fast when possible, deep when necessary” paradigm establishes a new standard for efficient, robust, and adaptive end-to-end autonomous driving.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 1209
Loading