Abstract: Understanding passenger intents from spoken interactions and the car’s vision (both inside and outside the vehicle) is an important building block towards developing contextual dialog systems for autonomous vehicles (AV). In this study, we continued exploring AMIE (Automated-vehicle Multimodal In-cabin Experience), the in-cabin agent responsible for handling multimodal passenger-vehicle interactions. When the passengers give instructions to AMIE, the agent should parse commands properly considering three modalities (i.e., verbal/language/text, vocal/audio, visual/video) and trigger the appropriate functionality of the AV system. We previously collected a multimodal in-cabin dataset with multi-turn dialogues between the passengers and AMIE using a Wizard-of-Oz (WoZ) scheme, experimented with various RNN-based models to detect utterance-level intents (i.e., set-destination, change-route, go-faster, go-slower, stop, park, pull-over, drop-off, open-door, other) along with relevant slots associated with these intents. In this work, we discuss the benefits of multimodal understanding of in-cabin utterances by incorporating verbal/language input together with non-verbal/acoustic and visual input from inside and outside the vehicle. This ongoing research has the potential impact of exploring real-world challenges with human-vehicle-scene interactions for autonomous driving support via spoken utterances.
Loading