Training Vision-Language Transformers from Captions

Published: 23 Oct 2023, Last Modified: 23 Oct 2023Accepted by TMLREveryoneRevisionsBibTeX
Abstract: Vision-Language Transformers can be learned without low-level human labels (e.g. class labels, bounding boxes, etc). Existing work, whether explicitly utilizing bounding boxes (Chen et al., 2020b; Tan & Bansal, 2019; Lu et al., 2019) or patches (Kim et al., 2021), assumes that the visual backbone must first be trained on ImageNet (Russakovsky et al., 2015) class prediction before being integrated into a multimodal linguistic pipeline. We show that this is not necessary and introduce a new model Vision-Language from Captions (VLC) built on top of Masked Auto-Encoders (He et al., 2022) that does not require this supervision. We seek to provide general advice on multimodal pretraining by examining the roles of (a) unimodal initialization, (b) unimodal architectural components and (c) data annotation in the pretraining corpus. Our extensive and carefully controlled studies suggest that none of the above factors is absolutely important in achieving versatile vision-language representations. We conclude our analysis with suggestions on the choices of initialization, architectural components, and annotation formats targeting a better balance between data efficiency and representation quality.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url:
Changes Since Last Submission: We have made the following changes suggested by the reviewers. - (2YtN) Edit the abstract to make our contributions clear - (2YtN) Clarify our motivation of controlling for the dataset size and initialization - (PotT, SRjs) Add comparisons with CoCA and SimVLM on model and training data scales - (SRjs) Add conceptual comparisons with PaLI, CLIPPO and Flamingo in related works - (PotT) Add conceptual comparisons with contrastive models such CLIP, CoCA in related works - (PotT,2YtN,SRjs) Add comparisons with recent models on GLUE evaluations - Add missing references such as VirTex, UL2, CLIPPO - Address Reviewer 2YtN's concern that "the linear probing [in the added results on ImageNet] show [CoCa/CLIP/etc.] to be better" and determine whether this is a case of overfitting or better hyperparameter tuning in the finetuning case.
Assigned Action Editor: ~Vincent_Dumoulin1
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 1277