From Training-Free to Adaptive: Empirical Insights into MLLMs' Understanding of Detection Information
Keywords: Multimodal Large Language Models, Object Detection
TL;DR: This paper shows that fine-tuning MLLMs with textual detection information boosts performance over training-free methods, retaining potential even with model replacements, highlighting the benefits of adaptive training for multimodal understanding.
Abstract: Despite the impressive capabilities of Multimodal Large Language Models (MLLMs) in integrating text and image modalities, challenges remain in accurately interpreting detailed visual elements. Fortunately, vision detection models have shown superior performance in recognizing fine-grained image details, leading to their increased deployment by researchers to enhance the ability of MLLMs. Among the feasible strategies, infusing detection information in text format is easy to use and effective. However, most studies apply this method in a training-free manner. There is limited research on the effects of adaptive training, which has great potential for helping LLMs better comprehend the special input and discard irrelevant information. In this paper, we address the key research question: How does training influence MLLMs' understanding of infused textual detection information? We systematically conduct experiments with numerous representative models to explore the performance implications of training-free, retraining, and fine-tuning strategies when infusing textual detection information into MLLMs. Additionally, we investigate the impact of training on the original abilities of MLLMs, as well as the interchangeability of detection models. We find that fine-tuning the pre-trained MLLM to adapt to textual detection information yields better results compared to the training-free strategy and the retraining strategy, with the fine-tuned MLLM outperforms the training-free MLLM by 6.71\% across 10 widely recognized benchmarks. Besides, we find that fine-tuning allows the MLLM to maintain performance improvements even after replacing the deployed detection models, which means that it enables the MLLM to better understand the specially formatted textual information. We release our codes to facilitate further exploration into the fusion strategies of vision detection models and improving the fine-grained multimodal capabilities of MLLMs.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8909
Loading