Abstract: Motion forecasting is a foundational task in autonomous driving, where accurate long-distance motion trajectories prediction of traffic participants depends heavily on modeling the long-range global context in traffic scenarios. Currently, the mainstream approach combines vectorized scene representation with the attention mechanism for context encoding. However, the inherent quadratic complexity of self-attention limits these attention-based methods’ ability to fully encode long-range context due to the prohibitive computational costs. Consequently, they generally perform local attention as a trade-off between performance and efficiency. Inspired by the recent success of state space models with linear complexity in long sequence modeling, this paper introduces the Attention-SSM Block (ASB) to capture long-range contextual features for motion forecasting. The ASB starts by extracting local context using simple local attention, then sorts these tokens in a specific order and inputs them into a modified SSM, which considers relative position encodings between input tokens. We build an encoder based on ASB and combine it with a query-based decoder to form our motion forecasting model, MambaTraj. MambaTraj achieves excellent performance on the widely-used Argoverse2 benchmark with a small network parameter size and low inference latency. This confirms its effectiveness and efficiency in modeling long-range context for motion forecasting.
Loading