COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

Published: 06 May 2025, Last Modified: 06 May 2025SynData4CVEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language, Compositionality, Data for Efficient Learning
Abstract: Multimodal Large Language Models (MLLMs) excel at simple vision-language tasks but struggle when faced with complex tasks that require compositional capabilities, such as combining foundational capabilities like object recognition, spatial understanding, and counting. Visual Instruction Tuning (VIT), a critical training step for MLLMs, has traditionally focused on scaling data volume but overlooks the compositional complexity of training examples, limiting their effectiveness in real-world scenarios. We propose COMPACT, COMPositional Atomic-to-complex Visual Capability Tuning that enables MLLMs to solve complex tasks by explicitly training them on compositions of foundational atomic capabilities. By generating training data with controlled compositional complexity and balanced distribution, COMPACT enables MLLMs to learn complex capabilities (k ≥ 1) more efficiently. With only 10% of the LLAVA-665K training data, COMPACT achieves 100.18% of the performance obtained using the full dataset. We observe that training with COMPACT on questions requiring up to k ≤ 3 capabilities exhibits strong generalization to complex multi-capability questions with k > 3 capabilities. COMPACT offers a scalable, data-efficient, atomic-to-complex visual compositional tuning recipe to improve on complex visual-language tasks.024
Supplementary Material: pdf
Submission Number: 56
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview