Abstract: Classifying intentions of other traffic agents is an essential task for intelligent transportation systems. To simplify this task, vehicles are equipped with various illumination systems, including turn indicators, emergency lights, rear lights, and brake lights. We extend the Waymo open perception dataset with ground truth annotations for different visual intentions to develop methods designed to classify the state of such systems. Furthermore, we propose the VISUAL INTENTION FORMER, a two-step transformer-based architecture to classify visual intentions in image sequences of tracked traffic participants. We use a vision transformer to extract image features, which are passed into a transformer encoder that reasons about temporal dependencies among them. We evaluate against different baseline architectures where our proposed method achieves state-of-the-art results. Additionally, we conduct an in-depth performance analysis of our method regarding different input sequence lengths, vehicle headings, and daytime conditions.
Loading