Learning Spatial-Aware Manipulation Ordering

Yuxiang Yan; Zhiyuan Zhou; Xin Gao; Guanghao Li; Shenglin Li; Jiaqi Chen; Qunyan Pu; Jian Pu

Learning Spatial-Aware Manipulation Ordering

Yuxiang Yan, Zhiyuan Zhou, Xin Gao, Guanghao Li, Shenglin Li, Jiaqi Chen, Qunyan Pu, Jian Pu

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: object manipulation, cluttered environment

TL;DR: OrderMind is a spatial-aware framework for manipulation ordering in cluttered environments. It learns object manipulation orderings by encoding spatial relationships. It outperforms existing methods in both simulated and real-world tasks.

Abstract: Manipulation in cluttered environments is challenging due to spatial dependencies among objects, where an improper manipulation order can cause collisions or blocked access. Existing approaches often overlook these spatial relationships, limiting their flexibility and scalability. To address these limitations, we propose OrderMind, a unified spatial-aware manipulation ordering framework that directly learns object manipulation priorities based on spatial context. Our architecture integrates a spatial context encoder with a temporal priority structuring module. We construct a spatial graph using k-Nearest Neighbors to aggregate geometric information from the local layout and encode both object-object and object-manipulator interactions to support accurate manipulation ordering in real-time. To generate physically and semantically plausible supervision signals, we introduce a spatial prior labeling method that guides a vision-language model to produce reasonable manipulation orders for distillation. We evaluate OrderMind on our Manipulation Ordering Benchmark, comprising 163,222 samples of varying difficulty. Extensive experiments in both simulation and real-world environments demonstrate that our method significantly outperforms prior approaches in effectiveness and efficiency, enabling robust manipulation in cluttered scenes.

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 4219

Loading