An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models

Published: 09 Jun 2025, Last Modified: 14 Jul 2025CODEML@ICML25EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Deep Learning, Neural Networks, Artificial Intelligence, Large Language Models, Vision-Language Models, Multimodal Models, Action Models, Benchmark, Open-Source
TL;DR: Open-source benchmark, framework, and toolkit to adapt and evaluate multimodal models across vision, language, and action tasks.
Abstract: Recent innovations in multimodal action models represent a promising direction for developing general-purpose agentic systems, combining visual understanding, language comprehension, and action generation. We introduce Multinet - a novel, fully open-source benchmark and surrounding software ecosystem designed to rigorously evaluate and adapt models across vision, language, and action domains. We establish standardized evaluation protocols for assessing vision-language models (VLMs) and vision-language-action models (VLAs), and provide open source software to download relevant data, models, and evaluations. Additionally, we provide a composite dataset with over a trillion tokens of image captioning, visual question answering, commonsense reasoning, robotic control, digital game-play, simulated locomotion/manipulation, and many more tasks. Our open-source benchmark, framework, toolkit, and evaluation harness have been used in downstream research on the limitations of VLA generalization.
Submission Number: 43
Loading