Keywords: Vehicle Teleoperation, LLM, Autonomous Driving
Abstract: Large-scale driverless fleets rely on teleoperation to resolve rare, safety-critical edge cases that onboard autonomy cannot handle robustly. We introduce FleetAgent, a cloud-hosted multimodal large language model (MLLM) that assesses the plan and context of an autonomous vehicle (AV) to decide whether teleoperation is needed. FleetAgent consumes a compact vectorized representation of observations and planned actions rather than raw sensors, and produces a natural-language explanation and evaluation towards the traffic scenario and driving decision. A dedicated vector encoder replaces conventional text tokenizers and vision encoders, substantially reducing the number of input tokens and server memory footprint while preserving the information needed for proper functioning. We also build a dataset based on nuScenes and augment it with synthetic imperfect driving decisions and annotated explanation and evaluation labels. System-level studies indicate a maximum $625 \times$ reduction in communication demand and a maximum $16.54 \times$ reduction in cache size. Model-level experiments also show competitive response quality and plan-evaluation accuracy, with a 41\% improvement in BLEU score and an 11\% reduction in task failure rate. Because all computation runs on the cloud, the approach introduces no additional onboard burden. Together, these results outline a practical path to scalable, explainable teleoperation support for AV fleets, paving the way to another paradigm for MLLMs' application in autonomous driving.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 13704
Loading