Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Large Language Model, Multi-Grained Annotations, Fine-Grained Annotations, Concept Annotations, Vision--Language Learning
TL;DR: We introduce a new dataset featuring Multimodal Multi-Grained Concept annotations (MMGiC). MMGiC can help MLLMs better learn concepts and utilize concepts to align vision and language in multi-granularity.
Abstract: Multimodal Large Language Models (MLLMs) excel in vision--language tasks by pre-training solely on coarse-grained concept annotations (e.g., image captions). We hypothesize that integrating fine-grained concept annotations (e.g., object labels and object regions) will further improve performance, as both data granularities complement each other in terms of breadth and depth in concept representation. We introduce a new dataset featuring Multimodal Multi-Grained Concept annotations (MMGiC) for MLLMs. In constructing MMGiC, we explore the impact of different data recipes on multimodal comprehension and generation. Our analyses reveal that multi-grained concept annotations integrate and complement each other, under our structured template and autoregressive discrete framework. We definitively show that multi-grained concepts do facilitate MLLMs to better locate and learn concepts, aligning vision and language at multiple granularities. We further validate our hypothesis by investigating the comparison and collaboration between MMGiC and image--caption data on 12 multimodal comprehension and generation benchmarks, e.g., their appropriate combination achieve 3.95% and 2.34% accuracy improvements on POPE and SEED-Bench. Code, data and models will be made openly available.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5424
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview