Published: 2022, Last Modified: 05 Nov 2023Trans. Mach. Learn. Res. 2022Readers: Everyone
Abstract:In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide...