DeTaCH: Decoupling Tasks and Control via a Meta-Gradient Hypernetwork

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Language-Conditioned Policy Learning
Abstract: Current language-conditioned robotic policies suffer from a fundamental architectural bottleneck: when language instructions and visual observations are processed through shared representations, networks cannot distinguish between task specification and state perception, leading to policies that exploit spurious visual correlations rather than grounded language semantics. We identify this phenomenon as modality confounding, where gradient interference and entangled representations prevent proper decomposition of task knowledge from perceptual processing. To address this limitation, we propose DeTaCH, which reconceptualizes language not as an input to be fused with vision (state), but as a meta-specification that generates parameters of task-specific visuomotor policies. Through a two-stage hypernetwork architecture combining semantic initialization with iterative neural gradient estimation, DeTaCH achieves explicit decoupling between language understanding and visual control. Experiments across 90 language-conditioned tasks in LIBERO and 45 tasks in Meta-World demonstrate that DeTaCH improves success rates to 51.4\% and 92.2\%, respectively, with particularly strong gains on complex, long-horizon tasks where modality confounding is most severe. The generated parameter manifold also exhibits semantic structure, enabling 25\% better few-shot adaptation than baselines with only three demonstrations. Our results suggest that explicit architectural separation of heterogeneous modalities may be essential for the generalization of multi-task manipulation policies.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 10803
Loading