CRUX: Counterfactual Multi-Hypothesis Interpretation for Robust Embodied Agents

06 Dec 2025 (modified: 07 Dec 2025)NeurIPS 2025 Workshop FMEA SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Embodied agents, Modular reinforcement learning, Temporal logic task planning, Robust policy selection, Counterfactual multi-hypothesis rollouts
TL;DR: CRUX is a modular embodied-agent pipeline that uses multi-hypothesis goal interpretation and counterfactual rollouts to choose robust plans, improving noisy performance while matching clean success.
Abstract: Embodied environments are a standard way to probe the reasoning abilities of large language models (LLMs) and other foundation models. Most systems follow the same pattern: a language front-end interprets the instruction, a subgoal module decomposes it, a planner proposes an action sequence, and a world model predicts the consequences. Yet performance is often reported as a single task-level success rate, which hides where the agent fails and how brittle it is under noise. We take a modular view. We write down a formal model of a four-stage embodied pipeline (goal interpretation, subgoal decomposition, planning, and world model ing), and show that, under a mild monotonicity assumption, the overall task failure probability is bounded by the sum of the error rates of the individual modules. We refine this picture for tree-structured subgoal decompositions and show how decomposition errors scale with the number of leaves. On top of this framework we build CRUX, a variant that treats interpretation and decomposition as multi-hypothesis problems. Rather than committing to a single interpretation of the language instruction, CRUX maintains a small set of candidate formal tasks and decompositions, plans for each, and uses a world model to run counterfactual rollouts under execution noise. The candidate with the highest estimated robust success is then selected and executed. We prove a finite-sample guarantee for this selection rule: with a reasonable number of rollouts, the robust performance of the chosen candidate is close to that of the best candidate in the set. To ground the theory, we implement a compact CRUX prototype in a stochastic gridworld. In this setting, both CRUX and a single-interpretation baseline achieve perfect success in a clean environment, but under action noise and random start perturbations CRUX consistently attains higher robust success. In a moderate noise regime, the 95% confidence intervals for robust success do not overlap between CRUX and the baseline.
Submission Number: 13
Loading