Keywords: Large Languagemodels, Iterative+Framework, Image+Captioning, Caption+Enhancement
TL;DR: An iterative frameowrk designed using large language models for image caption enhancement.
Abstract: Comprehensive image captioning is a critical task with applications spanning a multitude of domains such as assistive technologies, automated content development, e-commerce, surveillance and security, etc.
Research for image captioning has had a long history with numerous successes, however, a challenge remains in obtaining high quality labeled images for model training. While recent large visual language models such as GPT-4 are very capable of both generating detailed captions for images and generating labeled images for smaller models, they have certain issues. First, such models are expensive, either computationally or financially. Second, they require extensive prompt engineering to achieve the desirable outputs. And third, it is difficult to quantitatively evaluate the quality of captions that they generate without a ground truth. Addressing that challenge, we present an automated framework that allows multiple small models to collaborate on the task of comprehensive image captioning without the needs of labeled images. In brief, the system operates by having a captioner generate and continuously improve descriptions of input images so that a generator can synthesize images that are more and more similar to the original ones. The similarity among images is calculated by an evaluator. Through experiment study, we show that our framework provides considerable improvements in the comprehensiveness of captions over a standalone visual language model, bridging the gap between small models and larger ones such as GPT-4o.
Submission Number: 39
Loading