Orchestrating LLMs as Hierarchical Multi-Agent Reinforcement Learning System for Automotive Software Development

Published: 25 May 2026, Last Modified: 27 May 2026DEMO 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-Agent Reinforcement Learning, Hierarchical Reinforcement Learning, Large Language Models, LLM Agents, Agentic Software Engineering, Supervised Fine Tuning, Offline-to-Online Reinforcement Learning, Shadow Mode, Cyber-Physical Systems, Human-in-the-Loop Verification, Automotive Software Development
TL;DR: AutoEvolve is a hierarchical multi-agent RL framework using LLMs to automate software development. It combines offline fine-tuning with online RL to diagnose, fix, and optimize firmware - thereby reducing development cycle time by orders of magnitude
Abstract: Software-defined vehicles depend on firmware that must evolve continuously and safely, yet general-purpose LLM coding agents lack the architectural mechanisms required for safety-critical cyber-physical systems. We introduce **AutoEvolve**, a Hierarchical Multi-Agent Reinforcement Learning (H-MARL) framework with three contributions: (i) jointly learned Orchestrator and sub-agent (Data, Requirements, Code) policies, where an *adversarial* Requirements Agent rejects unsafe candidates rather than merely critiquing them; (ii) an offline-to-online curriculum that initializes via SFT on historical development trajectories and refines via PPO in a *shadow-mode* deployment running parallel to human engineers, treating their commits as delayed supervision; and (iii) a dual-reward decomposition $R_{fast} + \lambda R_{slow}$ that anchors policies to deterministic verification while regularizing toward maintainable, human-aligned style. On an internal benchmark, AutoEvolve attains the highest Success Rate (**61.4%** vs. 53.5 - 55.1% for MetaGPT and SWE-Agent at the same Llama-3-70B backbone) and reduces the Requirement Violation Rate to **1.6%** - ~7x fewer violations than task-agnostic multi-agent frameworks and ~10x fewer than a Monolithic ReAct baseline. The architectural contribution alone (SFT-Only AutoEvolve at 4.7% violations) drives most of the safety gain; online RL provides the remaining performance lift. Active Development Cycle Time drops from ~5 days to < 6 hours, framing safety-critical software evolution as *safe* agentic coding, not raw generation.
Submission Number: 19
Loading