Physics-Guided Multimodal Multi-Agent Learning for Intelligent Transportation Systems

Zhen Tian; Yaqiong Zhang; Zhihao Lin; Fujiang Yuan; Yijun Lu; Wangjie lang; Xinyu Wang; Ning Lyu; Zhiguo Tao; Kaijie Chen; Aaron Wang

Physics-Guided Multimodal Multi-Agent Learning for Intelligent Transportation Systems

Zhen Tian, Yaqiong Zhang, Zhihao Lin, Fujiang Yuan, Yijun Lu, Wangjie lang, Xinyu Wang, Ning Lyu, Zhiguo Tao, Kaijie Chen, Aaron Wang

Published: 01 Mar 2026, Last Modified: 24 Apr 2026ICLR 2026 AIWILDEveryoneRevisionsCC BY 4.0

Keywords: Multi-Agent Systems, Agentic Coordination, Hierarchical Agent Architectures, Physics-Guided Learning, Reinforcement Learning for Safety-Critical Control, Multimodal Agent Perception, Real-World Autonomous Systems

TL;DR: We present a physics-guided hierarchical multi-agent system that combines LLM-based semantic coordination with safe residual learning to enable robust decision-making in safety-critical real-world environments.

Abstract: Intelligent transportation systems (ITS) require reliable coordination among multiple vehicles operating under heterogeneous behaviors and time-varying traffic conditions. Prior approaches face a recurring trade-off: physics-based controllers provide interpretability and constraint handling but can be brittle under model mismatch, whereas data-driven policies adapt to complex scenarios but often lack safety guarantees and transparent decision logic. We propose a hierarchical, physics-guided framework that separates semantic coordination from continuous control execution. At the regional level, a large language model (LLM) generates discrete, human-interpretable coordination directives (e.g., yielding priority and target gaps) from multimodal observations. At the vehicle level, each directive is realized by a physics-informed controller augmented with a learned residual policy, where the residual is constrained and safety-filtered to preserve feasibility and closed-loop robustness. Multimodal fusion via vision–language models (VLMs) supports context-aware coordination by combining visual cues with textual traffic descriptors and temporal signals. In highway merging simulations, the proposed framework improves traffic throughput by 23% and reduces collision rates by 31% relative to classical and learning-based baselines, indicating that semantic coordination and physics-grounded execution can be combined without sacrificing safety-critical control requirements.

PDF: pdf

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 3

Loading