Hierarchical Approaches for Domain-Specific Image Captioning: Classification, Distillation, and Optimization
Abstract: This paper presents a novel hierarchical approach to improve image captioning, focusing on enhancing the accuracy, contextual appropriateness, and linguistic diversity of generated captions. We address key limitations of existing multimodal models, including inaccurate object relationships, missed details, and poor domain-specific understanding. Our method incorporates three main components: a classification-guided prompting system that utilizes domain-specific knowledge, a knowledge distillation framework that transfers captioning capabilities from GPT-4o to the LLaVA model, and an iterative Direct Preference Optimization (DPO) approach that refines caption quality. Extensive experiments demonstrate that our approach outperforms existing methods, achieving near-GPT-4o performance while maintaining computational efficiency. Additionally, we release a high-quality dataset of 9,840 image-caption pairs across 18 categories, providing valuable resources for future research in domain-specific image captioning.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: image text matching
Contribution Types: Data resources, Surveys
Languages Studied: English
Submission Number: 7701
Loading