How to Teach Large Multimodal Models New Skills

ICLR 2026 Conference Submission12670 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Models, Large Multimodal Models, Continual Learning
Abstract: How can we teach large multimodal models (LMMs) new skills without erasing prior abilities? We study sequential fine‑tuning on five target skills while monitoring general ability on eight held‑out benchmarks across three model families. We observe that apparent “forgetting” on held‑out tasks after narrow fine‑tuning can partly recover at later stages. We trace this behavior to a measurable shift in the output token distribution, manifested through a simple counting‑bias probe that identifies the shift co‑varies with forgetting. Guided by this picture, we identify two simple, robust tuning recipes that learn strongly while limiting drift: (i) updating only the self‑attention projection layers, and (ii) updating only the MLP Gate\&Up while freezing the Down projection. Across models and tasks, these choices deliver strong target gains while largely preserving held‑out performance.
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 12670
Loading