Keywords: Data enhancement, Visual clutter, Distractors, affordance prediction, Robotic manipulation
TL;DR: Robotic data enhancement via in-context visual scene editing
Abstract: Learning robust visuomotor policies for robotic manipulation remains a challenge in real-world settings, where visual distractors and clutter can significantly degrade performance. In this work, we highlight the challenges that visual clutter poses to robotic manipulation and propose an effective and scalable in-context visual scene editing (NICE) strategy based on real-world images. Our method synthesizes new variations of existing robot demonstration datasets by programmatically modifying non-target objects directly within the real scenes. This approach diversifies environmental conditions without requiring additional action generation, synthetic rendering, or simulator access. Using real-world scenes, we showcase the capability of our framework in performing realistic object replacement, restyling, and removal. We generate new data using NICE and finetune a vision-language model (VLM) for spatial affordance and a vision-language-action (VLA) policy for object manipulation. Our experiments show that using our editing framework results in more than a 20% increase in both accuracy in affordance prediction and success rate in manipulation.
Lightning Talk Video: mp4
Optional Poster Upload: pdf
Submission Number: 18
Loading