Fine-grained person-based image captioning via advanced spectrum parsing

Published: 01 Jan 2024, Last Modified: 02 Mar 2025Multim. Tools Appl. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recent image captioning models have demonstrated remarkable performance in capturing substantial global semantic information in coarse-grained images and achieving high object coverage rates in generated captions. When applied to fine-grained images that contain heterogeneous object attributes, these models often struggle to maintain the desired granularity due to inadequate attention to local content. This paper investigates a solution for fine-grained caption generation on person-based images and heuristically proposes the Advanced Spectrum Parsing (ASP) model. Specifically, we design a novel spectrum branch to unveil the potential contour features of detected objects in the spectrum domain. We also preserve the spatial feature branch employed in existing methods, and leverage a multi-level feature extraction module to extract both spatial and spectrum features. Further more, we optimize these features, aiming to learn the spatial-spectrum correlation and complete the feature concatenation procedure via a multi-scale feature fusion module. In the inference stage, the integrated features enable the model to focus more on the local semantic regions of the person in the image. Extensive experimental results demonstrate that the proposed ASP for person-based datasets can yield promising results with both comprehensiveness and fine graininess.
Loading