Simple is Better than Complex: A Representation-centric Perspective for Prompting-based Vision--Language Fusion

17 Apr 2026 (modified: 23 Apr 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Interactive prompting is an appealing approach to vision--language fusion using frozen unimodal transformers, yet recent progress often relies on increasingly complex prompting architectures. A natural question arises: instead of refining prompt designs, can fusion be improved more effectively by directly adapting internal representations within attention layers? Our analysis, from a representation-centric perspective, suggests that interactive prompting itself has limited ability to directly alter value token representations and intra-modal token interactions, motivating a lightweight alternative that targets these internal attention representations rather than increasing prompting complexity. Specifically, we investigate the cross-attention mechanism and propose combining value-only low-rank adaptation with a key--query replacement strategy, yielding a simple and parameter-efficient fusion design. Across common multimodal fusion benchmarks, the proposed method consistently outperforms prior prompting-based fusion baselines while requiring fewer trainable parameters. These results, along with further ablations, support representation-centric adaptation as an effective principle for prompting-based vision--language fusion.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Jaeho_Lee3
Submission Number: 8482
Loading