Tree-guided Diffusion Planner

Anonymous authors


Tree-guided Diffusion Planner (TDP) is a flexible training-free test-time planning framework that balances exploration and exploitation through structured trajectory generation. It addresses the limitations of gradient-based guidance by exploring diverse trajectory regions and harnessing gradient information across the expanded solution space.

header-image.

(1) Parent Branching: diverse parent trajectories are produced via fixed-potential particle guidance [1] to encourage broad exploration.

(2) Sub-Tree Expansion: sub-trajectories are locally refined through fast conditional denoising guided by task objectives.

TDP consistently outperforms state-of-the-art planning approaches across a wide range of guidance functions, involving non-convex objectives, non-differentiable constraints, and multi-reward structures.


Method

Parent Branching

Unlike conventional gradient-based guidance methods that pull samples toward high-reward regions, particle guidance introduces repulsive interactions by computing pairwise distances between trajectory samples, specifically over the control states, which encourages to push them apart in the state space. A single denoising step is defined as:

$$ \left[\boldsymbol{\mu}^{i}_{\text{control}},\; \boldsymbol{\mu}^{i}_{\text{obs}}\right] \leftarrow \boldsymbol{\mu}_{\theta}(\boldsymbol{\tau}^{i}) $$ $$ \boldsymbol{\mu}^{i}_{\text{control}} \leftarrow \boldsymbol{\mu}^{i}_{\text{control}} + \alpha_p \Sigma^i \nabla \Phi(\boldsymbol{\mu}^{i}_{\text{control}}), \quad \boldsymbol{\mu}^{i}_{\text{obs}} \leftarrow \boldsymbol{\mu}^{i}_{\text{obs}} + \alpha_g \Sigma^i \nabla \mathcal{J}(\boldsymbol{\mu}^{i}_{\text{obs}}) $$ $$ \boldsymbol{\mu}^{i} \leftarrow \left[\boldsymbol{\mu}^{i}_{\text{control}},\; \boldsymbol{\mu}^{i}_{\text{obs}}\right] $$ $$ \boldsymbol{\tau}^{i-1} \sim \mathcal{N}(\boldsymbol{\mu}^{i}, \Sigma^i) $$ where $\boldsymbol{\mu}^i_{\text{control}}$ and $\boldsymbol{\mu}^i_{\text{obs}}$ denote the control and observation components of the predicted mean of the denoising trajectory at timestep $i$, and $(\alpha_p, \alpha_g)$ are the guidance strengths for the particle guidance and gradient guidance, respectively.

Sub-Tree Expansion

For each parent trajectory, a random branch site is selected, and a child trajectory is generated by denoising from a partially noised version of the parent trajectory in order to refine parent trajectories using gradient guidance signals.

A full algorithm of TDP is provided in Algorithm.

Maze2D Gold-picking

Maze2D gold-picking task is a planning problem with a test-time non-differentiable constraint, where the agent must generate a feasible trajectory that satisfies an initial state, a final goal state, and an intermediate goal state (the gold position gold-coin Logo).

Two Gold-picking tasks in Maze2D-Large [3].

Gradient-based guidance typically requires selecting a guidance strength $\alpha$ to balance adherence to the guide signal and trajectory fidelity. However, $\alpha$ is highly task-dependent, and exhaustive tuning across tasks introduces significant overhead during evaluation. On the Maze2D gold-picking task, the MCSS (Monte-Carlo Sampling with Selection) baseline exhibits $\alpha$-dependent performance, whereas TDP remains robust across varying values of the guidance strength $\alpha$. $\alpha_0$ is guidance strength used in the main paper.


Pick-and-Where-to-Place ($\texttt{PnWP}$)

$\texttt{PnWP}$ with Kuka robot arm [4].

We introduce a non-convex exploration task in robot arm manipulation enviornment. The agent must infer suitable placement location based on the reward distribution and plan pick-and-place actions. Since $x^*_{local}$ has a wide peak and $x^*_{global}$ has a narrow peak, agents easily get stuck in the local optima unless the planner sufficiently explores the trajectory space. Mono-level guided sampling methods (i.e., TAT [2], MCSS) tend to converge to local optima, often stacking all blocks at $x^*_{local}$.

TDP is a bi-level search framework.

TAT
(Highest-weighted trajectory)
MCSS
(Highest-scoring trajectory)
TDP

AntMaze Multi-goal Exploration

Multi-goal Exploration in AntMaze-Large [3].

We introduce a multi-reward exploration task in AntMaze environment. The diffusion planner predicts the next 64 steps (highlighted in bright on the map) using a combined Gaussian reward signal from multiple goals. Goals must be visited in priority order, with higher-priority goals emitting stronger, narrower Gaussians.

For example, as illustrated in the figure above, the first goal the agent visits is $g_2$ at $t = t_3$. If the agent subsequently visits $g_1$, $g_4$, and $g_3$ after $t=t_3$, it successfully reaches all four goals ($g_2 \rightarrow g_1 \rightarrow g_4 \rightarrow g_3$). However, some of the goal priorities are violated. Specifically, the orderings $g_2 \rightarrow g_4$, $g_2 \rightarrow g_3$, $g_1 \rightarrow g_4$, and $g_1 \rightarrow g_3$ are correct, while $g_2 \rightarrow g_1$ and $g_4 \rightarrow g_3$ violate the intended priority. In this case, while the agent achieves a goal completion score of 4/4, its priority sequence match accuracy is only 4/6. The agent can achieve the maximum accuracy of 6/6 only by visiting all goals in the correct prioritized order—i.e., $g_1 \rightarrow g_2 \rightarrow g_3 \rightarrow g_4$.

TDP achieves more goals, with higher sequence accuracy.

MCSS
(X)
MCSS
($g_2 \rightarrow g_3 \rightarrow $ X)
TDP
($g_1 \rightarrow g_2 \rightarrow g_3 \rightarrow g_4$)

TDP completes tasks in fewer timesteps.

MCSS
($g_1 \rightarrow g_4 \rightarrow g_3 \rightarrow g_2$, slow)
TDP
($g_1 \rightarrow g_4 \rightarrow g_3 \rightarrow g_2$, fast)

In these videos, COMPLETE indicates that the agent has successfully visited all four goals, and SUCCESS indicates that the agent has successfully visited all four goals in the correct order.

Full Algorithm

Reference

[1]: Gabriele Corso, Yilun Xu, Valentin De Bortoli, Regina Barzilay, and Tommi S. Jaakkola. Particle guidance: non-i.i.d. diverse sampling with diffusion models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=KqbCvIFBY7.

[2]: Lang Feng, Pengjie Gu, Bo An, and Gang Pan. Resisting stochastic risks in diffusion planners with the trajectory aggregation tree, 2024. URL https://arxiv.org/abs/2405.17879.

[3]: Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning, 2021. URL https://arxiv.org/abs/2004.07219.

[4]: Caelan Reed Garrett, Tomás Lozano-Pérez, and Leslie Pack Kaelbling. Pddlstream: Integrating symbolic planners and blackbox samplers via optimistic adaptive planning, 2020. URL https://arxiv.org/abs/4181802.08705.