Mordal: Automated Pretrained Model Selection for Vision Language Models

Mordal: Automated Pretrained Model Selection for Vision Language Models

ICLR 2026 Conference Submission13598 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Model, Vision Language Model, Mode Selection

Abstract: Incorporating multiple modalities into large language models (LLMs) is a powerful way to enhance their understanding of non-textual data, enabling them to perform multimodal tasks. Vision language models (VLMs) form the fastest growing category of multimodal models because of their many practical use cases, including in healthcare, robotics, and accessibility. Unfortunately, even though different VLMs in the literature demonstrate impressive visual capabilities in different benchmarks, they are handcrafted by human experts; there is no automated framework to create task-specific multimodal models. We introduce Mordal, an automated multimodal model search framework that efficiently finds the best VLM for a user-defined task without manual intervention. Mordal achieves this both by reducing the number of candidates to consider during the search process and by minimizing the time required to evaluate each remaining candidate. Our evaluation shows that Mordal can find the best VLM for a given problem using $8.9\times$--$11.6\times$ lower GPU hours than grid search. In the process of evaluation, we have also discovered that Mordal achieves $1.2\times$--$3.3\times$ better performance than the state-of-the-art model selection methods on a variety of tasks.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 13598

Loading