Abstract: Highlights•We propose a hybrid model based on a CNN and Transformers.•The CNN learns the visual spatial features from various sensor data.•The learned visual features are then fed into the transformer.•The results reveal the superiority of the STVL-HM compared to the baseline solutions.
Loading