GIT: A Generative Image-to-text Transformer for Vision and Language

Jianfeng Wang; Zhengyuan Yang; Xiaowei Hu; Linjie Li; Kevin Lin; Zhe Gan; Zicheng Liu; Ce Liu; Lijuan Wang

GIT: A Generative Image-to-text Transformer for Vision and Language

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang

Published: 13 Dec 2022, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on numerous challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: - Added the experiments of zero/few/full shots on Flickr captioning task in Sec. 4.2/Table 3 of the main paper and table 7 of the supplementary materials. - Added the extra information of using external datasets for nocaps benchmark to reduce confusion in footnote 2. - Added the zero/few-shots results on ImageNet classification task in Sec 4.4/Table 9 of the main paper and table 13 of the supplementary materials. - Added the details on how the video frames are sampled in Sec D of the supplementary materials. - Added the results on TextOCR in Sec. F of the supplementary materials. - Added the ablation study of different initialization schemes for the text decoder transformer in Sec G.3/Table 16 of the supplementary materials. - Added the ablation study of the different initialization schemes for the image encoder in Sec. G.4/Table 17 of the supplementary materials. - Added the bias study over gender and skin in Sec G.6/Table 19 of the supplementary materials.

Assigned Action Editor: ~Marcus_Rohrbach1

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Submission Number: 391

Loading