FlexCap: Generating Rich, Localized, and  Flexible Captions in Images

Debidatta Dwibedi; Vidhi Jain; Jonathan Tompson; Andrew Zisserman; Yusuf Aytar

FlexCap: Generating Rich, Localized, and Flexible Captions in Images

Debidatta Dwibedi, Vidhi Jain, Jonathan Tompson, Andrew Zisserman, Yusuf Aytar

18 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: representation learning for computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: visual-language model, object detection, image captioning, visual question answering

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: FlexCap: A model that describes regions in images in a controllable manner.

Abstract: We introduce FlexCap, a module that generates localized descriptions for any region in a given image. We use the idea of length conditioning to ensure the output captions have the desired length. This allows for controllable generation of the full spectrum of localized captions, ranging from short object names to full sentence descriptions. To train this model, we create a dataset of image-box-caption triplets from web-scale text-image pairs using open-vocabulary object detection models. We show that FlexCap can connect images with LLMs by representing images as a sequence of region descriptions and their spatial extents. Using this interpretable textual representation, we exceed the state-of-the-art zero-shot performance on many visual question answering tasks. We also show that FlexCap can be fine-tuned to achieve strong performance on the dense captioning task on the Visual Genome dataset. Finally, we demonstrate qualitatively how FlexCap can be used for image labeling, object attribute recognition, and visual dialog.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: zip

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1451

Loading