Abstract: This paper addresses the challenge of handling unseen modalities and dynamic modality combinations at test time with our proposed text-centric alignment method. This training-free alignment approach unifies different input modalities into a single semantic text representation by leveraging in-context learning with Large Language Models and uni-modal foundation models. Our method significantly enhances the ability to manage unseen, diverse, and unpredictable modality combinations, making it suitable for both generative and discriminative models to adopt on top. Our extensive experiments primarily evaluate on discriminative tasks, demonstrating that our approach is essential for LLMs to achieve strong modality alignment performance. It also surpasses the limitations of traditional fixed-modality frameworks in embedding representations. This study contributes to the field by offering a flexible and effective solution for real-world applications where modality availability is dynamic and uncertain.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings
Languages Studied: English
Submission Number: 775
Loading