Track: tiny paper (up to 4 pages)
Keywords: Multimodal, CLIP, CLAP, Contrastive Learning, NLP, Computer Vision, Speech
TL;DR: Shared encoders in small multimodal transformers outperform modality-specific ones, achieving better retrieval with fewer parameters, though adding modalities can introduce performance trade-offs under fixed capacity.
Abstract: Shared encoders have proven effective for large-scale multimodal contrastive learning, but it is less clear whether their advantages persist in small, parameter-constrained regimes. We investigate this question through a focused empirical study by training models under strict transformer parameter budgets on a naturally aligned text, image, and speech dataset. Across a range of small model configurations, we observe that allocating transformer parameters to a single shared encoder often yields better retrieval performance than splitting the same capacity across modality-specific encoders. We further find that merging modality-specific encoders into a shared encoder can substantially reduce transformer parameters while preserving comparable performance on several modality pairs. Finally, in trimodal training, we observe an empirical trade-off in which adding a third modality improves weaker modality pairs while degrading stronger ones under fixed capacity. These results suggest that, in tightly constrained settings, parameters allocated to shared representations can be an effective default for parameter-efficient multimodal learning.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 81
Loading