MARVIS: Modality Adaptive Reasoning over VISualizations

02 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: VLMs, LLMs, Tabular, TSNe, Visualization
TL;DR: We propose MARVIS (Modality Adaptive Reasoning over VISualizations), a training-free method that enables even small vision-language models to predict any data modality with high accuracy (classification and regression).
Abstract: Predictive applications of machine learning often rely on small (sub 1 Bn parameter) specialized models tuned to particular domains or modalities. Such models often achieve excellent performance, but lack flexibility. LLMs and VLMs offer versatility, but typically underperform specialized predictors, especially on non-traditional modalities and long-tail domains, and introduce risks of data exposure. We propose MARVIS (Modality Adaptive Reasoning over VISualizations), a training-free method that enables small vision-language models to solve predictive tasks on any data modality with high accuracy, and without exposing private data to the VLM. MARVIS transforms latent embedding spaces into visual representations and then leverages the spatial and fine-grained reasoning skills of VLMs to interpret the visualizations and utilize them for predictions successfully. MARVIS achieves competitive performance on vision, audio, biological, and tabular domains using a single 3B parameter model, achieving results that beat Gemini 2.0 by 16\% on average. MARVIS drastically reduces the gap between LLM/VLMs approaches and specialized domain-specific methods, without exposing sensitive data or requiring any domain-specific training. We open source our code and datasets at https://anonymous.4open.science/r/marvis-6F54
Supplementary Material: pdf
Primary Area: foundation or frontier models, including LLMs
Submission Number: 1120
Loading