Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving
Abstract: Autonomous driving (AD) has experienced significant improvements in recent years and achieved
promising 3D detection, classification, and localization results. However, many challenges remain,
e.g. semantic understanding of pedestrians’ behaviors, and downstream handling for pedestrian
interactions. Recent studies in applications of Large Language Models (LLM) and Vision-Language
Models (VLM) have achieved promising results in scene understanding and high-level maneuver
planning in diverse traffic scenarios. However, deploying the billion-parameter LLMs to vehicles requires significant computation and memory resources. In this paper, we analyzed effective knowledge
distillation of LLM semantic labels to smaller Vision networks, which can be used for the semantic
representation of complex scenes for downstream decision-making for planning and control.
Loading