VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

Shraman Pramanick; Li Jing; Sayan Nag; Jiachen Zhu; Hardik J Shah; Yann LeCun; Rama Chellappa

VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

Shraman Pramanick, Li Jing, Sayan Nag, Jiachen Zhu, Hardik J Shah, Yann LeCun, Rama Chellappa

Published: 01 Feb 2023, Last Modified: 14 Jan 2026Submitted to ICLR 2023Readers: Everyone

Keywords: self-supervision, vision-language pre-training, transformer, patch-word alignment

Abstract: Vision-language pre-training (VLP) has recently proven highly effective for various uni- and multi-modal downstream applications. However, most existing end-to-end VLP methods use high-resolution image-text-box data to perform well on fine-grained region-level tasks, such as object detection, segmentation, and referring expression comprehension. Unfortunately, such high-resolution images with accurate bounding box annotations are expensive to collect and use for supervision at scale. In this work, we propose VoLTA (Vision-Language Transformer with weakly-supervised local-feature Alignment), a new VLP paradigm that only utilizes image-caption data but achieves fine-grained region-level image understanding, eliminating the use of expensive box annotations. VoLTA adopts graph optimal transport-based weakly-supervised alignment on local image patches and text tokens to germinate an explicit, self-normalized, and interpretable low-level matching criterion. In addition, VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre-training and removes fusion-specific transformer layers, further reducing memory requirements. Extensive experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA on fine-grained applications without compromising the coarse-grained downstream performance, often outperforming methods using significantly more caption and box annotations.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Unsupervised and Self-supervised learning

TL;DR: We introduce VoLTA, Vision-Language Transformer with weakly-supervised local-feature Alignment, a VLP paradigm trained with graph optimal transport (GOT) based image-text matching.

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/volta-vision-language-transformer-with-weakly/code)

17 Replies

Loading