Abstract: Highlights•We build a new dataset with multimodal triples for multi-modality perception.•A fusion framework enhances caption semantics by combining visual and auditory data.•Extensive experiments validate our framework and confirm its effectiveness.•ChatGPT generates syntactic structures to demonstrate framework availability.
Loading