HybridSB-MoE: Dual-Domain Schrödinger Bridges with Scene-Adaptive Expert Routing for Speech Enhancement

ICLR 2026 Conference Submission13365 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Speech Enhancement, Schrödinger Bridge, Mixture-of-Experts, Dual-domain Processing
Abstract: Single-domain generative speech enhancement methods fail to exploit complementary acoustic representations. Despite recent advances in Schr\"{o}dinger Bridge (SB) formulations, existing approaches remain constrained by homogeneous architectures and prohibitively high sampling costs. We propose \textbf{HybridSB-MoE}, a framework that integrates SB with a heterogeneous mixture-of-experts (MoE) for parallel dual-domain processing. Our framework uniquely combines temporal coherence modeling via enhanced SB in the waveform domain with scene-adaptive spectral processing through five architecturally distinct experts (Home, Nature, Office, Transport, Public), automatically selected via sparse Top-$k$ routing without scene labels. By implementing trajectory regularization that incorporates optimal transport and path consistency, we reduce the required number of sampling steps from 40-50 to just 8, while maintaining quality. An uncertainty-aware fusion unifies these complementary representations using calibrated weights derived from epistemic (MoE) and aleatoric (SB) uncertainties. On the VoiceBank+DEMAND dataset, HybridSB-MoE achieves PESQ $3.88\pm0.25$ and STOI $0.96$, surpassing methods that require $5\times$ more sampling steps. Ablation studies confirm the necessity of each component, with the PESQ dropping to 3.45 without SB and 3.25 without MoE.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 13365
Loading