FuseForm: Multimodal Transformer for Semantic Segmentation

Justin McMillen, Yasin Yilmaz

Published: 2025, Last Modified: 29 Jan 2026WACV (Workshops) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: For semantic segmentation, integrating multimodal data can vastly improve segmentation performance at the cost of increased model complexity. We introduce FuseForm, a multimodal transformer for semantic segmentation, which can effectively and efficiently fuse a large number of ho-mogeneous modalities. We demonstrate its superior performance on 5 different multimodal datasets ranging from 2 to 12 modalities and comprehensively analyze its components. FuseForm outperforms existing methods through two novel features, a hybrid multimodal fusion block and a transformer-based decoder. It leverages a multimodal cross-attention module for global token fusion, alongside convolutional filters' ability to fuse local features. Global and local fusion modules together enable enhanced mul-timodal semantic segmentation. We also introduce a de-coder based on a mirrored version of the encoder trans-former, which outperforms a popular decoder when tuned sufficiently on the dataset.

External IDs:dblp:conf/wacv/McMillenY25