Systematic Evaluation of Modular Robotic Manipulation Policies via Structured Condition Space: A Study on Precision Pick-and-Place Tasks

Published: 21 May 2026, Last Modified: 21 May 2026ICRA 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Performance Evaluation and Benchmarking, Precision Pick and Place, Perception for Grasping and Manipulation
TL;DR: We propose a systematic evaluation framework that constructs a structured condition space with LLM assistance, enabling fine-grained characterization of manipulation robustness boundaries under real-world variations.
Abstract: Foundation models have demonstrated strong potential for robotic manipulation, promising adaptability across diverse tasks and environments. Despite favorable benchmark performance, these systems often exhibit degraded or unstable behavior under variations in object geometry, spatial configuration, sensing conditions, and workspace constraints—challenges that are amplified in real-world deployment. Existing evaluation efforts broaden evaluation across multiple perturbation axes but primarily focus on condition-level sensitivity. This offers limited insight into the precise configurations under which policies become unreliable. We propose a systematic evaluation framework that explicitly constructs a structured condition space with LLM assistance. Leveraging the semantic and commonsense priors encoded in large language models, we decompose high-level evaluation factors into structured, parameterized subspaces, enabling scalable exploration of environmental variations. This design shifts evaluation from coarse condition-level analysis to structured reliability boundary identification. We further introduce a modular architecture that compiles robotic manipulation policies within this unified framework and supports execution analysis across diverse conditions. Experimental results on precision pick-and-place tasks demonstrate that enables fine-grained characterization of performance degradation and failure patterns, providing actionable insights for robustness assessment and real-world deployment.
Submission Number: 30
Loading