Multimodality-guided Visual-Caption Semantic Enhancement

Published: 2024, Last Modified: 02 Mar 2026Comput. Vis. Image Underst. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•We build a new dataset with multimodal triples for multi-modality perception.•A fusion framework enhances caption semantics by combining visual and auditory data.•Extensive experiments validate our framework and confirm its effectiveness.•ChatGPT generates syntactic structures to demonstrate framework availability.
Loading