R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: A test-time method that dynamically adjusts Mixture-of-Experts routing weights to boost multimodal model performance without any retraining
Abstract: In large multimodal models (LMMs), the perception of non-language modalities (e.g., visual representations) is usually not on par with the large language models (LLMs)' powerful reasoning capabilities, deterring LMMs' performance on challenging downstream tasks. This weakness has been recently mitigated by replacing the vision encoder with a mixture-of-experts (MoE), which provides rich, multi-granularity, and diverse representations required by different downstream tasks. The performance of multimodal MoE largely depends on its router, which reweights and mixes the representations of different experts for each input. However, we find that the end-to-end trained router does not always produce the optimal routing weights for every test sample. To bridge the gap, we propose a novel and efficient method ''**R**e-**R**outing in **T**est-**T**ime (R2-T2)'' that locally optimizes the vector of routing weights in test-time by moving it toward those vectors of the correctly predicted samples in a neighborhood of the test sample. We propose three R2-T2 strategies with different optimization objectives and neighbor-search spaces. R2-T2 consistently and significantly improves state-of-the-art LMMs' performance on challenging multimodal benchmarks of diverse tasks, without training any parameters in the base model. Our code can be accessed here.
Lay Summary: Most AI assistants struggle to choose the right “tools” when answering questions about images, leading to mistakes (Problem). We solved this by letting the system look at past examples when it gets a new image-and-question task and borrow the expert “recipe” that worked before (Solution). Without changing or retraining the AI, our approach automatically picks the best experts—like a chef following a trusted recipe—to understand objects, read text, or judge spatial relationships. This means the model gives more accurate answers on the spot, even for tough questions. In practice, our method helps AI assistants become smarter about which visual skills to use each time they face a new task. People and companies will benefit because they can get better image-based insights without the cost of rebuilding or retraining large AI systems (Impact).
Link To Code: https://github.com/tianyi-lab/R2-T2
Primary Area: General Machine Learning->Transfer, Multitask and Meta-learning
Keywords: mixture-of-experts, test-time optimization, multimodal models
Submission Number: 411
Loading