BoMM: Multi-Modality Large-Small Model Bidirectional Collaboration

Wei Guo; Jiale Mao; Yiqi Tong; Chuyu Fang; Fuzhen Zhuang; Xiao Zhang; Yikun Ban; Zhaojun Hu; Yiyang Duan; Jin Dong

BoMM: Multi-Modality Large-Small Model Bidirectional Collaboration

Wei Guo, Jiale Mao, Yiqi Tong, Chuyu Fang, Fuzhen Zhuang, Xiao Zhang, Yikun Ban, Zhaojun Hu, Yiyang Duan, Jin Dong

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Distributed Learning, model collaboration, large model

Abstract: Different from existing single-modality large-small model collaborations, multi-modality large-small model collaboration is an under-explored paradigm where cloud-side multi-modality large model (MM-LM) collaborates with parties' small models (SMs) to achieve bidirectional domain-specific performance improvements. Nevertheless, this paradigm faces two key challenges. First, MM-LM inherently relies on abundant modality-aligned samples for training, but geographical and device diversity across parties inevitably lead to different collected samples and modalities. These differences significantly reduce overlapping sample entities across parties' multi-modality datasets, creating modality alignment scarcity challenge. Second, collected device failure and human annotation costs further lead to different modality missing problems in each party's dataset. Existing modality completion methods typically require enough modality-completed training samples to ensure generation quality, creating a modality completeness gap challenge. To address these challenges, we propose a multi-modality large-small model bidirectional collaboration framework, named BoMM, which consists of two key components. Specifically, global prototype-guided alignment strategy identifies potentially aligned samples through similarity distribution comparisons between unaligned data and established global prototypes, enabling knowledge transfer from SMs to MM-LM. With established prototypes, preference-driven modality adaptive completion method integrates direct preference optimization into generator training with real-time scheduling to dynamically complete missing modalities, enabling knowledge transfer from MM-LM to SMs. Theoretical analysis confirms BoMM's O(1/\sqrt{T}) convergence rate. Across three multi-modality scenarios, it outperforms state-of-the-art methods by up to 6.64\% on two well-known datasets. Our code is available at https://anonymous.4open.science/r/MultiLM-5D65.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 7219

Loading