Abstract: Highlights•Comprehensive survey of multimodal fusion and VLMs for robotic vision tasks.•Extend beyond segmentation to SLAM, manipulation, and embodied navigation.•Highlight multimodal advantages in robustness, alignment, and reasoning ability.•Analyze key robotics datasets on modality mix, task scope, and practical limits.•Propose future directions on training efficiency and cross-modal alignment.
External IDs:dblp:journals/inffus/HanCFFFAWGMZXX26
Loading