Abstract: Highlights•First data augmentation method in image captioning to fix non-discriminative caption.•VFM creates challenging samples and helps the model focus on key visual details.•SEM provides mixed language input, and OM offers mixed objective supervision.
Loading