RSAM-UNETR: Transformers for Road Segmentation for self driving cars using Residual Spatial Attention Module

Alicja Polowczyk, Agnieszka Polowczyk, Marcin Wozniak

Published: 2025, Last Modified: 01 Apr 2026IJCNN 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Analysis of the immediate environment and its segmentation is particularly useful in the mechanism of autonomous vehicles. Accurate marking of important points and places while traversing the route allows cars to analyze the space and navigate safely. Transformers models and their successors such as UNETR and Vision Transformer using Multi-Head Attention are widespread in the segmentation process. These approaches permit to better image classification and segmentation results, as each head in the transformer simultaneously analyzes different aspects of the relationship between image fragments (patches). Given the high potential of transformer algorithms, in this article we present an extended UNETR model, which in its design additionally includes the RSAM attention block (Residual Spatial Attention Module) for the road segmentation process. Our RSAM-UNETR architecture supported by additional weights in the loss function performed perfectly and achieved Accuracy = 99.04%, mDice = 89.21% and mIoU = 81.80% with a very stable training process achieving a higher mDice value by 5.46% and mIoU by 7.31% compared to the classic UNETR.