Multimodal fusion and vision-language models: A survey for robot vision

Published: 01 Jan 2026, Last Modified: 05 Nov 2025Inf. Fusion 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•Comprehensive survey of multimodal fusion and VLMs for robotic vision tasks.•Extend beyond segmentation to SLAM, manipulation, and embodied navigation.•Highlight multimodal advantages in robustness, alignment, and reasoning ability.•Analyze key robotics datasets on modality mix, task scope, and practical limits.•Propose future directions on training efficiency and cross-modal alignment.
Loading