Text-centric Alignment for Bridging Test-time Unseen Modality

Text-centric Alignment for Bridging Test-time Unseen Modality

ACL ARR 2025 May Submission775 Authors

15 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper addresses the challenge of handling unseen modalities and dynamic modality combinations at test time with our proposed text-centric alignment method. This training-free alignment approach unifies different input modalities into a single semantic text representation by leveraging in-context learning with Large Language Models and uni-modal foundation models. Our method significantly enhances the ability to manage unseen, diverse, and unpredictable modality combinations, making it suitable for both generative and discriminative models to adopt on top. Our extensive experiments primarily evaluate on discriminative tasks, demonstrating that our approach is essential for LLMs to achieve strong modality alignment performance. It also surpasses the limitations of traditional fixed-modality frameworks in embedding representations. This study contributes to the field by offering a flexible and effective solution for real-world applications where modality availability is dynamic and uncertain.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings

Languages Studied: English

Submission Number: 775

Loading