A Practitioner's Guide to Real-World Continual Multimodal Pretraining

Vishaal Udandarao; Karsten Roth; Sebastian Dziadzio; Ameya Prabhu; Mehdi Cherti; Oriol Vinyals; Olivier J Henaff; Samuel Albanie; Zeynep Akata; Matthias Bethge

A Practitioner's Guide to Real-World Continual Multimodal Pretraining

Vishaal Udandarao, Karsten Roth, Sebastian Dziadzio, Ameya Prabhu, Mehdi Cherti, Oriol Vinyals, Olivier J Henaff, Samuel Albanie, Zeynep Akata, Matthias Bethge

Published: 26 Sept 2024, Last Modified: 13 Nov 2024NeurIPS 2024 Track Datasets and Benchmarks PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: continual pretraining, multimodal learning, vision-language, CLIP

TL;DR: We provide practical insights into how to continually pretraining contrastive multimodal models under compute and data constraints.

Abstract: Multimodal foundation models serve numerous applications at the intersection of vision and language. Still, despite being pretrained on extensive data, they become outdated over time. To keep models updated, research into continual pretraining mainly explores scenarios with either (1) infrequent, indiscriminate updates on large-scale new data, or (2) frequent, sample-level updates. However, practical model deployment often operates in the gap between these two limit cases, as real-world applications demand adaptation to specific subdomains, tasks or concepts --- spread over the entire, varying life cycle of a model. In this work, we complement current perspectives on continual pretraining through a research test bed and offer comprehensive guidance for effective continual model updates in such scenarios. We first introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements, constructed over 63 datasets with diverse visual and semantic coverage. Using FoMo-in-Flux, we explore the complex landscape of practical continual pretraining through multiple perspectives: (1) data mixtures and stream orderings that emulate real-world deployment settings, (2) methods ranging from simple fine-tuning and traditional continual learning strategies to parameter-efficient updates and model merging, (3) meta-learning-rate schedules and mechanistic design choices, and (4) model and compute scaling. Together, our insights provide a practitioner's guide to continual multimodal pretraining for real-world deployment. Benchmark and code is provided here: https://github.com/ExplainableML/fomo_in_flux.

Supplementary Material: pdf

Submission Number: 2171

Loading