ViRL-TSC: Enhancing Reinforcement Learning with Vision-Language Models for Context-Aware Traffic Signal Control

ICLR 2026 Conference Submission9279 Authors

17 Sept 2025 (modified: 24 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, Vision-Language Models, Traffic Signal Control
Abstract: In real-world urban environments, traffic signal control (TSC) must maintain stability and efficiency under highly uncertain and dynamically changing traffic conditions. Although reinforcement learning (RL) has shown strong adaptability in dynamic environments, existing methods still depend on predefined state spaces and cannot directly perceive the environment. Consequently, they fail to exploit visual semantic information, which restricts their ability to generalize to unseen or evolving traffic conditions during training. To overcome these limitations, we introduce ViRL-TSC, a unified framework that integrates RL with Vision–Language Models (VLMs). A Foundation Model-Driven Visual Reasoning Engine (FM-VRE) fuses visual inputs with structured information matrices to generate high-level multimodal semantic representations of intersections. These representations are then processed by a Foundation Model-Driven Decision Evaluation Engine (FM-DEE), which integrates them with the RL agent’s proposed actions. The RL policy ensures efficient control in scenarios encountered during training, while the VLM leverages logical reasoning and contextual analysis to handle rare events beyond the scope of RL training. By combining RL’s task-specific policy optimization with the VLM’s rich semantic understanding, ViRL-TSC maintains high efficiency during routine operations and selectively intervenes to enhance robustness under long-tail traffic conditions.
Primary Area: reinforcement learning
Submission Number: 9279
Loading