A Survey of Robotic Learning for Perception and Manipulation: From Modular Pipelines to Robotic Foundation Models

A Survey of Robotic Learning for Perception and Manipulation: From Modular Pipelines to Robotic Foundation Models

TMLR Paper8816 Authors

08 May 2026 (modified: 29 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Over the past decade, robotic manipulation systems have undergone a fundamental paradigm shift: from carefully engineered hierarchical pipelines to data-driven foundation-model-based robotic policies. Following the 2015 DARPA Robotics Challenge, classical systems relied on decomposed perception-planning-control architectures with strong modeling assumptions and task-specific engineering. Since then, advances in machine learning, large-scale visual representation learning, and robot interaction data collection have enabled a progression toward imitation learning policies, end-to-end generative visuomotor policies, and, most recently, robotic foundation models capable of multi-task and cross-embodiment generalization. This survey provides a structured perspective on this evolution from the viewpoint of robotic perception and manipulation. We introduce a taxonomy of manipulation systems organized along architectural transitions: \textit{hierarchical pipelines, imitation-based policies, learning-based generative visuomotor policies, and robotic foundation models (e.g., VLAs)}, and analyze each paradigm in terms of system design, data requirements, and embodied intelligence capabilities such as compositionality, generalization, and adaptability. Beyond model architectures, we examine the scaling of data that underpins recent progress, covering developments in large-scale visual and 3D datasets, in-the-wild robot interaction corpora, and emerging multimodal sensing modalities including tactile and force feedback. We further discuss emerging directions that integrate robotics foundation models with reinforcement learning and world models to enable online adaptation and long-horizon reasoning in physical environments. We review current benchmarks and evaluation protocols, highlighting limitations in measuring generalization, safety, and data efficiency, and conclude by outlining open challenges toward general-purpose embodied agents, including interaction-centric scaling, safety and alignment in physical deployment, multimodal perception integration, and the fusion of cognitive abstraction with physical reasoning. By synthesizing architectural, data-centric, and systems-level trends, this survey aims to provide both a conceptual map of robotic learning’s recent trajectory and a forward-looking agenda for advancing robotic manipulation toward truly general embodied intelligence.

Submission Type: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Efstratios_Gavves1

Submission Number: 8816

Loading