Abstract: For semantic segmentation, integrating multimodal data can vastly improve segmentation performance at the cost of increased model complexity. We introduce FuseForm, a multimodal transformer for semantic segmentation, which can effectively and efficiently fuse a large number of ho-mogeneous modalities. We demonstrate its superior performance on 5 different multimodal datasets ranging from 2 to 12 modalities and comprehensively analyze its components. FuseForm outperforms existing methods through two novel features, a hybrid multimodal fusion block and a transformer-based decoder. It leverages a multimodal cross-attention module for global token fusion, alongside convolutional filters' ability to fuse local features. Global and local fusion modules together enable enhanced mul-timodal semantic segmentation. We also introduce a de-coder based on a mirrored version of the encoder trans-former, which outperforms a popular decoder when tuned sufficiently on the dataset.
External IDs:dblp:conf/wacv/McMillenY25
Loading