Making Text-Image Connection Formal and Practical

Published: 11 Jul 2023, Last Modified: 16 Jul 2023NCW ICML 2023EveryoneRevisionsBibTeX
Keywords: GPT, ViT, Transformers, Zero-Shoot, Computer Vision, Natural Language Processing
Abstract: Text and image feature extraction is at the core of several state-of-the-art artificial intelligence algorithms, including DALLE-2, Stable Diffusion, and Segment Anything. However, models that connect images and texts are usually trained using hundreds of GPUs and tens or even hundreds of millions of data points, making it infeasible for most agents to perform the training from scratch. Furthermore, these groundbreaking works necessitate more formally defined algorithms to enable easier adoption and implementation. To address these issues, this paper elaborates on a formal and intuitive algorithm for text-image connections and proposes an alternative to train CLIP, a neural network model that learns joint representations from text and images, on low computing resources. In our experimentation, two models were trained on 85% of WKIT, a dataset of text-image pairs, by making use of mixed precision in back-propagation and shrinking the input images' resolution and the query's maximum length relative to the original CLIP in a setting constrained to a single GPU. Our results show that it is not only feasible to train image-text connection models from scratch in this constrained setting but also that reducing the input image resolution image results in better accuracy for zero-shoot classification.
Submission Number: 3
Loading