Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-top Manipulation

Chuye Zhang; Xiaoxiong Zhang; Linfang Zheng; Wei Pan; Wei Zhang

Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-top Manipulation

Chuye Zhang, Xiaoxiong Zhang, Linfang Zheng, Wei Pan, Wei Zhang

Published: 08 Aug 2025, Last Modified: 16 Sept 2025CoRL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: robotic manipulation, action-label-free learning, generative visual foresight

TL;DR: we propose a novel closed-loop framework that integrates generative visual foresight with task-agnostic pose estimation

Abstract: Robotic manipulation in unstructured environments requires systems that can generalize across diverse tasks while maintaining robust and reliable performance. We introduce GVF-TAPE, a closed-loop framework that combines generative visual foresight with task-agnostic pose estimation to enable scalable robotic manipulation. GVF-TAPE employs a generative video model to predict future RGB-D frames from a single RGB side-view image and a task description, offering visual plans that guide robot actions. A decoupled pose estimation model then extracts end-effector poses from the predicted frames, translating them into executable commands via low-level controllers. By iteratively integrating video foresight and pose estimation in a closed loop, GVF-TAPE achieves real-time, adaptive manipulation across a broad range of tasks. Extensive experiments in both simulation and real-world settings demonstrate that our approach reduces reliance on task-specific action data and generalizes effectively, providing a practical and scalable solution for intelligent robotic systems

Supplementary Material: zip

Spotlight: mp4

Submission Number: 498

Loading