Filter Before Mixing: Per-Modality Denoising for Multimodal RL with Application to Health Management
Abstract: Multimodal reinforcement learning agents must fuse signals
with vastly different noise profiles---yet existing architectures,
whether monolithic ($\pi$0, DreamerV3) or modular (MSDP,
VTDexManip), allow noise from unreliable modalities to
contaminate reliable ones at the point of fusion.
We propose \emph{filter-before-mixing}: each modality's
representation is independently refined by a per-modality
Flow Matching module before spectral-domain fusion via a
Fourier Neural Operator (FNO), with a residual gate ensuring
that refinement is never harmful.
The resulting architecture, \textbf{FreamerV1}
(Filter-before-mixing dreamer), has 93M parameters
(0.4M trainable).
On MiniGrid, FreamerV1 reaches 87.7\% $\pm$ 8.2\% (3 seeds)
at 5000 episodes, while the encoder-only baseline degrades to
78\% due to catastrophic forgetting.
\red{With OGM-GE (On-the-fly Gradient Modulation) for adaptive
per-modality gate control, FreamerV1 achieves an 8.0\% relative
improvement in success rate over manual tuning with halved
seed-to-seed variance (3 seeds).}
On Crafter (no language modality), it achieves
\red{an 11.7\% relative improvement over DreamerV3 in
the official Crafter score (geometric mean of 22
achievement success rates; 10 seeds).}
On PAMAP2 wearable sensors---where no pre-trained encoder
exists---the foundation encoder achieves 2.4$\times$ higher
reward and 16$\times$ lower variance than a vanilla MLP,
confirming that the filter-before-mixing advantage grows
with encoder noise.
Loading