SRA-MoE: Output-Aware Selective Router Alignment for MoE Quantization

Published: 01 Jun 2026, Last Modified: 09 Jun 2026AdaptFM PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mixture-of-Experts, Quantization, LLM, Reasoning, Optimization
TL;DR: Selective Router Alignment (SRA) improves low-bit Mixture-of-Experts LLM quantization by aligning only tokens with meaningful output changes, outperforming conventional router alignment in quantized MoE models.
Abstract: Mixture-of-Experts (MoE) architectures enable scalable large language models (LLMs), but their deployment remains memory-intensive, making quantization essential. However, quantizing MoE models introduces routing shifts, where expert selection differs from the baseline model and can degrade performance. Existing router alignment methods uniformly minimize routing discrepancies across all tokens, implicitly treating all routing shifts as equally important. In this work, we show routing shifts exhibit highly heterogeneous impact on model outputs: while some routing shifts substantially affect output behavior, many others induce negligible output discrepancy despite large routing changes. Motivated by this observation, we propose Selective Router Alignment (SRA), an output-aware alignment strategy that prioritizes optimization on tokens exhibiting meaningful output discrepancy after quantization. Experiments across multiple MoE LLMs and reasoning benchmarks show that SRA generally improves over conventional router alignment. Our findings suggest that effective MoE router alignment depends not only on reducing router shifts, but also on prioritizing those that meaningfully affect output behavior.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 132
Loading