Local Mixtures of Experts: Essentially Free Test-Time Training via Model Merging

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY-NC-SA 4.0
Keywords: test-time training, model merging, mixture of experts, language modeling, local learning, transductive learning
TL;DR: We propose Test-Time Model Merging (TTMM) which approaches the performance of Test-Time Training (TTT) without almost any test-time overhead.
Abstract: Mixture of expert (MoE) models are a promising approach to increasing model capacity without increasing inference cost, and are core components of many state-of-the-art language models. However, current MoE models typically use only few experts due to prohibitive training and inference cost. We propose _**T**est-**T**ime **M**odel **M**erging_ (TTMM) which scales the MoE paradigm to orders of magnitude more experts and uses model merging to avoid almost any test-time overhead. We show that TTMM is an approximation of test-time training (TTT), which fine-tunes an expert model for each prediction task, i.e., prompt. TTT has recently been shown to significantly improve language models, but is computationally expensive. We find that performance of TTMM improves with more experts and approaches the performance of TTT. Moreover, we find that with a 1B parameter base model, _TTMM is more than $100\times$ faster than TTT_ at test-time by amortizing the cost of TTT at train-time. Thus, TTMM offers a promising cost-effective approach to scale test-time training.
Supplementary Material: zip
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 1013
Loading