Workshop Statement: In this work, we address the problem of instruction-following navigation in dynamic environments, where instructions specify how a robot should interact with entities in its surroundings—such as moving humans, static obstacles, or specific terrain regions. To tackle this problem, we introduce ComposableNav, a composable, diffusion-based motion planner. The key insight behind our approach is that complex instructions can be decomposed into individual specifications, each corresponding to a distinct motion primitive. We leverage diffusion models to learn these primitives and compose them at deployment to accommodate different combinations of instruction specifications. Finally, solving this problem would allow end users (human or AI agents) to customize robotic behaviors beyond their default settings, in ways that align with user preferences and nuanced social interactions.
Keywords: diffusion-based motion planner, diffusion composition, dynamic navigation, instruction-following navigation
TL;DR: We propose a composable, diffusion-based motion planner to address the instruction-following navigation in dynamic environments.
Abstract: We study how robots navigate dynamic environments while following instructions. Unlike prior work where the instructions only specify the navigation goals in static environments, our work focuses on instructions that specify robot behaviors (e.g., “yield to a pedestrian”). This problem poses two key challenges: (1) the robot must learn to satisfy an exponential number of specification combinations across
different instructions, and (2) the robot must reasoning about multiple specifications concurrently rather than processing
them sequentially when operating in dynamic environments. To address these challenges, we propose ComposableNav, based
on the insight that following an instruction amounts to independently satisfying its constituent specifications, each satisfied by a different motion primitive. ComposableNav uses diffusion models to individually learn these primitives and composes them in parallel at deployment to generate an instruction-following trajectory. For example, “overtake the pedestrian in front and stay on the sidewalk” is achieved by composing the primitives “overtake the pedestrian” and “stay on the sidewalk.” In addition, we introduce a two-stage training procedure consisting of supervised pre-training followed by reinforcement learning fine-tuning, enabling effective learning of each motion primitive
without requiring primitive-specific demonstrations. Through both simulation and real-world experiments, we show that
ComposableNav enables robots to follow a broad range of instructions and significantly outperforms both non-compositional VLM-based policies and baselines that compose costmaps.
Submission Number: 29
Loading