M3ixup: A multi-modal data augmentation approach for image captioning

Published: 01 Jan 2025, Last Modified: 13 Nov 2024Pattern Recognit. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•First data augmentation method in image captioning to fix non-discriminative caption.•VFM creates challenging samples and helps the model focus on key visual details.•SEM provides mixed language input, and OM offers mixed objective supervision.
Loading