TapSampling: Inference-Time Sampling with a Task-Progress-Understanding Verifier for Robotic Manipulation

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Existing embodied control research demonstrates remarkable performance improvements by scaling training data and model size. We instead explore inference-time strategy as an alternative axis. Non-deterministic generative models, such as diffusion and autoregressive models, have been widely adopted in the field of embodied control. However, the single-shot inference paradigm limits their performance. In this paper, we propose **TapSampling**, a plug-and-play framework for inference-time sampling. First, we introduce an Action-VAE that represents actions in a low-dimensional latent space by mapping policy-generated initial actions into a compressed posterior distribution, from which any number of latent samples can be drawn and decoded into candidate actions that approximate the true action distribution. Second, we formulate action verification as task-progress outcome prediction, using the intrinsic sequential structure of robotic datasets to train a semantically grounded verifier for interpretable action selection. Furthermore, TapSampling is a policy-agnostic framework. Extensive experiments in both simulated and real-world environments demonstrate that our method substantially improves multiple generalist policies without further policy finetuning. Code and models are available at the **project page** (https://aipixel.github.io/TapSampling/).
Lay Summary: Robots are increasingly controlled by powerful generative AI policies, but their performance can still be unstable: in the same situation, a policy may produce an action that succeeds in one trial and fails in another. Most systems use only one generated action at each decision step, so they cannot benefit from other feasible actions the policy might have produced. We introduce TapSampling, a method that obtains multiple feasible candidate robot actions and then generates a steadier action that is more likely to improve task progress. Instead of repeatedly running the full robot policy, TapSampling efficiently creates candidate actions from a compact learned representation of the policy's initial output. It then evaluates each candidate by estimating what would happen after taking that action, such as whether it would bring the robot closer to grasping, lifting, or placing an object. TapSampling can be added to different robot policies without retraining them. By generating multiple candidate actions and evaluating their possible consequences, it helps robots obtain better actions and more consistently execute actions that support task completion. Experiments in simulation and the real world show that this improves the reliability of several general-purpose robot control policies.
Link To Code: https://github.com/aipixel/TapSampling
Primary Area: Applications->Robotics
Keywords: Embodied AI, Inference-time Sampling, Generalist Policies
Originally Submitted PDF: pdf
Submission Number: 991
Loading