CoTBal: Comprehensive Task Balancing for Multi-Task Visual Instruction Tuning

Published: 01 Jan 2024, Last Modified: 16 May 2025CoRR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Visual instruction tuning is an important training stage for large multimodal models. Nevertheless, when learning multiple visual tasks simultaneously, this approach may lead to suboptimal and imbalanced overall performance due to latent knowledge conflicts across tasks. To mitigate this issue, we introduce a novel Comprehensive Task Balancing (CoTBal) algorithm tailored for multi-task visual instruction tuning. To our knowledge, this is the first work to explore multi-task optimization in visual instruction tuning. Specifically, we consider two critical dimensions for task balancing: (1) Inter-Task Contribution, which represents the phenomenon where learning one task could enhance the performance on others owing to the overlapping knowledge domains across tasks, and (2) Intra-Task Difficulty, which indicates the inherent learning difficulty of a single task. Furthermore, by quantifying these with performance-based metrics, comprehensive task balancing is thus achieved by assigning greater weight to tasks that offer substantial contributions to others, receive minimal contributions from others, and present high learning difficulties. Extensive experiments on three benchmarks demonstrate that our CoTBal algorithm results in superior and more balanced overall performance in multi-task visual instruction tuning.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview