Keywords: Embodied agents, Modular reinforcement learning, Temporal logic task planning, Robust policy selection, Counterfactual multi-hypothesis rollouts
TL;DR: CRUX is a modular embodied-agent pipeline that uses multi-hypothesis goal interpretation and counterfactual rollouts to choose robust plans, improving noisy performance while matching clean success.
Abstract: Embodied environments are a standard way to probe the reasoning abilities of large
language models (LLMs) and other foundation models. Most systems follow the
same pattern: a language front-end interprets the instruction, a subgoal module
decomposes it, a planner proposes an action sequence, and a world model predicts
the consequences. Yet performance is often reported as a single task-level success
rate, which hides where the agent fails and how brittle it is under noise.
We take a modular view. We write down a formal model of a four-stage embodied
pipeline (goal interpretation, subgoal decomposition, planning, and world model
ing), and show that, under a mild monotonicity assumption, the overall task failure
probability is bounded by the sum of the error rates of the individual modules.
We refine this picture for tree-structured subgoal decompositions and show how
decomposition errors scale with the number of leaves.
On top of this framework we build CRUX, a variant that treats interpretation and
decomposition as multi-hypothesis problems. Rather than committing to a single
interpretation of the language instruction, CRUX maintains a small set of candidate
formal tasks and decompositions, plans for each, and uses a world model to run
counterfactual rollouts under execution noise. The candidate with the highest
estimated robust success is then selected and executed. We prove a finite-sample
guarantee for this selection rule: with a reasonable number of rollouts, the robust
performance of the chosen candidate is close to that of the best candidate in the set.
To ground the theory, we implement a compact CRUX prototype in a stochastic
gridworld. In this setting, both CRUX and a single-interpretation baseline achieve
perfect success in a clean environment, but under action noise and random start
perturbations CRUX consistently attains higher robust success. In a moderate noise
regime, the 95% confidence intervals for robust success do not overlap between
CRUX and the baseline.
Submission Number: 13
Loading