Keywords: VLA, Diffusion, Manipulation, Cluttered Scene, Multi-task, Cross-embodiment
TL;DR: We propose GAM, a unified diffusion-based VLA model that generates and scores diverse manipulation actions across embodiments via language prompts, validated at>95% success over 10M real-world cycles.
Abstract: We present the Generalized Action Model (GAM), a production-grade foundation model that unifies robotic action generation across diverse tasks and embodiments through a vision-language-action (VLA) pipeline. GAM addresses two fundamental barriers in scaling robotic manipulation: the lack of a unified representation for diverse robot end-effectors and the prohibitive cost of acquiring high-quality interaction data at scale. Our approach introduces (1) a unified language-prompted policy and critic that generates and scores diverse manipulation actions---including suction grasps, pinch grasps, caging, and placements---from a single model, (2) a scalable offline data generation pipeline that recomputes dense action candidates and quality labels in simulation from real-world observations, and (3) an end-effector encoding that enables zero-shot transfer to unseen hardware. We validate GAM on a fleet of robotic work-cells, where it has executed over 10 million pick-and-place cycles with greater than $95\%$ pick and greater than $90\%$ place success rates. The same model generalizes to hybrid end-effectors with distinct grasping modes at greater than $90\%$ success.
Submission Number: 46
Loading