Expert-Consensus Modality Fusion Quantization for MoE Vision-Language Models

Lujun Li

Published: 31 Dec 2025, Last Modified: 26 Mar 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: The Mixture-of-Experts (MoE) architecture enables scalable Vision-Language Models (VLMs) by decomposing complex cross-modal tasks across specialized experts, but its inherent heterogeneity—including divergent expert weight distributions, non-uniform activation patterns, and distinct vision-language token characteristics—poses significant challenges for Post-Training Quantization (PTQ). Existing PTQ methods either treat experts uniformly or ignore cross-modal interactions, leading to degraded performance under low-bit quantization. To address this, we propose Expert-Consensus Modality Fusion Quantization (ECMFQ), a unified PTQ framework that harmonizes expert heterogeneity and cross-modal alignment for MoE VLMs. ECMFQ introduces three key components: 1) Expert Consensus Smoothing (ECS), a weighted smoothing objective that captures the diverse weight distributions of all potentially activated experts, optimized via an outlier-robust search algorithm to derive a unified per-channel scaling matrix; 2) Modality-Affinity Consensus Selection (MACS), a calibration sample selection strategy that balances expert importance, cross-modal token affinity, and global activation patterns to build a robust consensus across modalities; 3) Fusion-Aware Quantization (FAQ), which integrates cross-modal fusion weights into the quantization process, ensuring that the critical interactions between vision and language tokens are preserved during compression. Evaluations across diverse benchmarks (Kimi-VL, Qwen3-VL, COCO-VL) show that ECMFQ consistently outperforms SOTA PTQ methods. Under W4A8, ECMFQ maintains 98.7% of the full-precision accuracy on Qwen3-VL, while reducing memory usage by 75%, and achieves a 1.89% average accuracy gain on cross-modal retrieval tasks compared to the previous best method. As a compatible framework, ECMFQ lowers the barrier to efficient deployment of MoE VLMs across edge and cloud environments.