Demystifying the Underappreciated Long-Tail Problems in Large Vision Language Models

Mingyang Song; Xiaoye Qu; Jiawei Zhou; Yu Cheng

Demystifying the Underappreciated Long-Tail Problems in Large Vision Language Models

Mingyang Song, Xiaoye Qu, Jiawei Zhou, Yu Cheng

23 Sept 2024 (modified: 14 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LVLMs, Long-Tail Issue, Data Synthesis

TL;DR: We propose an Adaptive Data Refinement Framework (ADR) to analyze and address long-tail data issues in LVLMs, improving model performance without increasing data volume.

Abstract: Recently, Large Vision-Language Models (LVLMs) have made significant progress, seamlessly integrating the visual comprehension capabilities of vision encoders with the language generation strengths of language models (LMs). Despite the success of LVLMs, the training or aligning data of LVLMs suffers from the $\textit{Long-Tail (LT)}$ problems, which is a special type of data with highly imbalanced distributions, and a large number of tail (minority) instances. A significant amount of research has focused on mitigating LT through data adjustment or network structure reorganization, however, efforts targeting generative LVLMs remain limited. In this paper, we present an in-depth analysis of the LT issues persisting in LVLMs' training data and build a distribution of four perspectives, addressing both visual and language aspects. To mitigate the aforementioned challenges, we propose an $\textbf{A}$daptive $\textbf{D}$ata $\textbf{R}$efinement Framework ($\textbf{ADR}$), which consists of two stages: $\textbf{D}$ata $\textbf{R}$ebalancing (DR) and $\textbf{D}$ata $\textbf{S}$ynthesis (DS). In the DR stage, we adaptively rebalance the redundant data based on entity distributions, while in the DS stage, we leverage the latent representations of scarce images to adaptively supplement the underrepresented portions. To validate the effectiveness of our approach, we conduct experiments on a series of comprehensive benchmarks, including the GPT-assisted evaluations to assess the overall performance variations introduced by our method. Through comprehensive evaluations, ADR effectively mitigates the long-tail problem in the training data, improving the average performance of LLaVA 1.5 relatively by $\textbf{2.62\%}$ across 10 benchmarks, without increasing the training data volume.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2832

Loading