Try Before You Buy: Solving Multi-Model Complex Tasks by Model Competitions

Published: 01 Jan 2025, Last Modified: 12 Nov 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Multi-modal large language models (MLLMs) are expanded from large language models (LLMs) with additional capabilities to infer multi-modal data. Current MLLM workflows, when dealing with complex tasks, typically begin by using an LLM to decompose the task into multiple subtasks, then heuristically select a specific pre-trained model to complete a subtask to get a result, and finally integrate all the results to obtain the final response. However, heuristically binding one model to one subtask may generate a less satisfying subtask result, thereby affecting the overall performance. Therefore, we propose CompeMLLM, which introduces an innovative method of dynamic orchestration of the workflows. It allows different models to compete on the same subtask instead of statically binding them. By dynamically integrating the results from diverse models, the optimal subtask result is determined, thereby improving the overall performance of MLLM. Specifically, given a certain complex task, CompeMLLM first decomposes it into subtasks, then employs multiple pre-trained models to execute the same subtask in parallel to compete, and then the optimal subtask result is chosen by dynamically evaluating these results using ensemble learning idea, and finally integrates these optimal results into a complete workflow, thus obtaining the best overall performance. We conducted extensive experiments using six advanced MLLMs as baselines across seven diverse datasets. The experimental results robustly demonstrate that CompeMLLM achieves significantly improved performance on all datasets, demonstrating its effectiveness.
Loading