Filter Before Mixing: Per-Modality Denoising for Multimodal RL with Application to Health Management

Tsuyoshi Okita

Published: 26 May 2026, Last Modified: 26 May 2026Electronics, MDPIEveryoneCC BY 4.0

Abstract: Multimodal reinforcement learning agents must fuse signals with vastly different noise profiles---yet existing architectures, whether monolithic ($\pi$0, DreamerV3) or modular (MSDP, VTDexManip), allow noise from unreliable modalities to contaminate reliable ones at the point of fusion. We propose \emph{filter-before-mixing}: each modality's representation is independently refined by a per-modality Flow Matching module before spectral-domain fusion via a Fourier Neural Operator (FNO), with a residual gate ensuring that refinement is never harmful. The resulting architecture, \textbf{FreamerV1} (Filter-before-mixing dreamer), has 93M parameters (0.4M trainable). On MiniGrid, FreamerV1 reaches 87.7\% $\pm$ 8.2\% (3 seeds) at 5000 episodes, while the encoder-only baseline degrades to 78\% due to catastrophic forgetting. \red{With OGM-GE (On-the-fly Gradient Modulation) for adaptive per-modality gate control, FreamerV1 achieves an 8.0\% relative improvement in success rate over manual tuning with halved seed-to-seed variance (3 seeds).} On Crafter (no language modality), it achieves \red{an 11.7\% relative improvement over DreamerV3 in the official Crafter score (geometric mean of 22 achievement success rates; 10 seeds).} On PAMAP2 wearable sensors---where no pre-trained encoder exists---the foundation encoder achieves 2.4$\times$ higher reward and 16$\times$ lower variance than a vanilla MLP, confirming that the filter-before-mixing advantage grows with encoder noise.