Keywords: Mixture-of-Experts (MoE), Pretrained Encoder, Diffusion Policy, Robotic Manipulation
Abstract: The integration of pretrained encoders with diffusion policies has emerged as a dominant paradigm for visual robotic manipulation. However, it still struggles to generalize across complex environments with varying factors like lighting and surface textures.
To address this, we propose FAME, a framework that integrates a factor-aware mixture-of-experts (MoE) with a pretrained encoder to significantly enhance generalization to environmental variations. FAME involves a three-stage training process. (1) policy warmup, where a diffusion policy is trained on data from a standard environment using a frozen encoder. (2) factor-specific adapter training, where we separately train a series of lightweight adapters, inserted between the frozen encoder and the temporally frozen policy, on customized datasets, each focusing on a distinct environmental variation. (3) joint fine-tuning, where we simultaneously train a centric router and the warmed policy on a mixed dataset to handle multiple factors at once. We say FAME is ``factor-aware'' because the central router organizes the frozen factor-specific adapters as a MoE, allowing for combinatorial generalization for multiple factors.
Evaluations on the Meta-World benchmark with various environmental factors show that our proposed FAME significantly outperforms existing diffusion policy baselines. Furthermore, FAME demonstrates remarkable scaling properties as the number of demonstrations increases. We believe our FAME provides an effective solution for achieving combinatorial generalization in visual robotic control tasks.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 15174
Loading