Keywords: Data Augmentation, Vision Language Model, Long-tail Task
Abstract: Vision-Language Models (VLMs) have made significant strides in multimodal understanding tasks, yet their robustness faces severe challenges when dealing with the long-tail data distributions common in the real world, especially in high-stakes domains like medical image analysis.
To address this challenge, we propose MontageAug, a compositional data augmentation approach designed specifically for long-tail vertical domains.
It strategically composes images (particularly from head and tail classes) to construct a novel visual scene (a montage image) and synchronously generates a perfectly corresponding compositional text description.
This method not only fundamentally guarantees the semantic fidelity of the augmented samples but also effectively alleviates the long-tail data problem by creating information-rich hard positive samples.
We conducted extensive experimental validation on a model based on the InternVL architecture using ophthalmic medical benchmarks.
The results show that MontageAug significantly enhances the model's recognition performance and generalization on tail classes, achieving state-of-the-art (SOTA) performance that surpasses existing augmentation methods on several benchmarks.
Furthermore, to explore the approach's extensibility, we validated it on Mathematical Expression Recognition (MER), achieving consistent improvements.
Our work ultimately demonstrates that MontageAug, as an efficient, low-cost, and semantics-preserving VLM augmentation strategy, holds practical value in solving the long-tail problem in specialized domains.
We plan to open-source our code, benchmark data, and models upon paper acceptance.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 11887
Loading