A Method for Visual Spatial Description Based on Large Language Model Fine-tuning

Published: 2024, Last Modified: 13 Nov 2024ACM Multimedia 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In recent years, the task of image-to-text generation has received considerable attention from scholars. One of its subtasks, Visual Spatial Description (VSD), focuses on a model's ability to understand spatial relationships. VSD is a novel task that emphasizes spatial semantics by generating sentences describing the spatial relationships between two objects in a given image. In this work, a VSD method based on large language model fine-tuning (LFVSD) is proposed to enhance the accuracy and robustness of visual spatial relationship descriptions. Initially, image and text features are extracted using pre-trained models, and Q-Former is employed for feature fusion. The original and fused features are then fed into FlanT5XXL. Object overlap priors are introduced, and momentum distillation is used to filter hard negative samples and generate soft labels. Finally, multiple VSD models are trained using data augmentation and long-tail data balancing techniques. Through multimodal feature fusion and fine-tuning, our approach is evaluated on the VSD2024 test set, which includes 5,855 images and their corresponding textual descriptions. The results demonstrate the effectiveness of our proposed method.
Loading