Unlike conventional gradient-based guidance methods that pull samples toward high-reward regions, particle guidance introduces repulsive interactions by computing pairwise distances between trajectory samples, specifically over the control states, which encourages to push them apart in the state space. A single denoising step is defined as:
$$ \left[\boldsymbol{\mu}^{i}_{\text{control}},\; \boldsymbol{\mu}^{i}_{\text{obs}}\right] \leftarrow \boldsymbol{\mu}_{\theta}(\boldsymbol{\tau}^{i}) $$ $$ \boldsymbol{\mu}^{i}_{\text{control}} \leftarrow \boldsymbol{\mu}^{i}_{\text{control}} + \alpha_p \Sigma^i \nabla \Phi(\boldsymbol{\mu}^{i}_{\text{control}}), \quad \boldsymbol{\mu}^{i}_{\text{obs}} \leftarrow \boldsymbol{\mu}^{i}_{\text{obs}} + \alpha_g \Sigma^i \nabla \mathcal{J}(\boldsymbol{\mu}^{i}_{\text{obs}}) $$ $$ \boldsymbol{\mu}^{i} \leftarrow \left[\boldsymbol{\mu}^{i}_{\text{control}},\; \boldsymbol{\mu}^{i}_{\text{obs}}\right] $$ $$ \boldsymbol{\tau}^{i-1} \sim \mathcal{N}(\boldsymbol{\mu}^{i}, \Sigma^i) $$ where $\boldsymbol{\mu}^i_{\text{control}}$ and $\boldsymbol{\mu}^i_{\text{obs}}$ denote the control and observation components of the predicted mean of the denoising trajectory at timestep $i$, and $(\alpha_p, \alpha_g)$ are the guidance strengths for the particle guidance and gradient guidance, respectively.For each parent trajectory, a random branch site is selected, and a child trajectory is generated by denoising from a partially noised version of the parent trajectory in order to refine parent trajectories using gradient guidance signals.
A full algorithm of TDP is provided in Algorithm.
Maze2D gold-picking task is a planning problem with a test-time non-differentiable constraint, where the agent must generate a feasible trajectory that satisfies an initial state, a final goal state, and an intermediate goal state (the gold position
).
Gradient-based guidance typically requires selecting a guidance strength $\alpha$ to balance adherence to the guide signal and trajectory fidelity.
However, $\alpha$ is highly task-dependent, and exhaustive tuning across tasks introduces significant overhead during evaluation.
On the Maze2D gold-picking task, the MCSS (Monte-Carlo Sampling with Selection) baseline exhibits $\alpha$-dependent performance, whereas TDP remains robust across varying values of the guidance strength $\alpha$.
$\alpha_0$ is guidance strength used in the main paper.
We introduce a non-convex exploration task in robot arm manipulation enviornment. The agent must infer suitable placement location based on the reward distribution and plan pick-and-place actions. Since $x^*_{local}$ has a wide peak and $x^*_{global}$ has a narrow peak, agents easily get stuck in the local optima unless the planner sufficiently explores the trajectory space. Mono-level guided sampling methods (i.e., TAT [2], MCSS) tend to converge to local optima, often stacking all blocks at $x^*_{local}$.
We introduce a multi-reward exploration task in AntMaze environment. The diffusion planner predicts the next 64 steps (highlighted in bright on the map) using a combined Gaussian reward signal from multiple goals. Goals must be visited in priority order, with higher-priority goals emitting stronger, narrower Gaussians.
For example, as illustrated in the figure above, the first goal the agent visits is $g_2$ at $t = t_3$. If the agent subsequently visits $g_1$, $g_4$, and $g_3$ after $t=t_3$, it successfully reaches all four goals ($g_2 \rightarrow g_1 \rightarrow g_4 \rightarrow g_3$). However, some of the goal priorities are violated. Specifically, the orderings $g_2 \rightarrow g_4$, $g_2 \rightarrow g_3$, $g_1 \rightarrow g_4$, and $g_1 \rightarrow g_3$ are correct, while $g_2 \rightarrow g_1$ and $g_4 \rightarrow g_3$ violate the intended priority. In this case, while the agent achieves a goal completion score of 4/4, its priority sequence match accuracy is only 4/6. The agent can achieve the maximum accuracy of 6/6 only by visiting all goals in the correct prioritized order—i.e., $g_1 \rightarrow g_2 \rightarrow g_3 \rightarrow g_4$.
In these videos, COMPLETE indicates that the agent has successfully visited all four goals, and SUCCESS indicates that the agent has successfully visited all four goals in the correct order.
[1]: Gabriele Corso, Yilun Xu, Valentin De Bortoli, Regina Barzilay, and Tommi S. Jaakkola. Particle guidance: non-i.i.d. diverse sampling with diffusion models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=KqbCvIFBY7.
[2]: Lang Feng, Pengjie Gu, Bo An, and Gang Pan. Resisting stochastic risks in diffusion planners with the trajectory aggregation tree, 2024. URL https://arxiv.org/abs/2405.17879.
[3]: Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning, 2021. URL https://arxiv.org/abs/2004.07219.
[4]: Caelan Reed Garrett, Tomás Lozano-Pérez, and Leslie Pack Kaelbling. Pddlstream: Integrating symbolic planners and blackbox samplers via optimistic adaptive planning, 2020. URL https://arxiv.org/abs/4181802.08705.