Neural Implicit Action Fields: From Discrete Waypoints to Continuous Functions for Vision-Language-Action Models
TL;DR: We propose Neural Implicit Action Fields (NIAF), which reformulates VLA action representation from discrete waypoints to continuous differentiable functions, enabling infinite-resolution sampling and stable impedance control.
Abstract: Despite the rapid progress of vision-language-action (VLA) models, the prevailing practice of predicting action chunks as discrete waypoints remains structurally misaligned with the intrinsic continuity of physical motion. This discretization arises naturally from fixed-rate robot data collection and the token-by-token prediction paradigm of large language models, but ties actions to rigid sampling rates, does not naturally support analytically consistent higher-order derivatives, and introduces quantization artifacts that hinder precise, compliant interaction. We propose Neural Implicit Action Fields (NIAF), which reformulates chunk-level action representation from discrete waypoints to continuous action functions. Using a vision-language model as a hierarchical spectral modulator over a learnable motion prior, NIAF synthesizes continuous-time action manifolds with arbitrary temporal resolution. This formulation enables analytical differentiation, allowing explicit supervision of velocity and regularization of higher-order derivative signals to promote mathematical consistency, physical plausibility, and control smoothness. Our approach achieves strong results on CALVIN and LIBERO across diverse backbones. Real-world experiments further confirm that NIAF supports stable impedance control, bridging policy-side action generation and execution-side smooth control.
Lay Summary: Vision-language-action (VLA) models typically predict robot actions as short chunks of discrete waypoints, such as future joint positions for a robot arm. Although this format is convenient for imitation learning, this discretization is structurally mismatched with the continuous nature of robot motion. It ties the learned policy to a fixed action sampling rate, and makes velocity and higher-order motion signals difficult to obtain reliably, since they usually have to be recovered by interpolation or finite differences.
This paper proposes Neural Implicit Action Fields (NIAF), which represents each action chunk as a continuous-time function rather than a fixed sequence of waypoints. Given visual observations, proprioception, and a language instruction, the VLA model predicts modulation parameters for a sinusoidal representation network (SIREN), an MLP with sinusoidal activation functions. Because SIRENs define continuously differentiable functions, we can query actions at arbitrary temporal resolutions and analytically compute velocity. This enables direct velocity supervision and jerk regulazation during training and provides smoother position and velocity references for impedance control during execution.
By treating robot actions as continuous functions, NIAF makes the action representation better aligned with how robot arms physically move.
Primary Area: Applications->Robotics
Keywords: Vision-Language-Action Models, Implicit Action Representation, Robotic Manipulation
Originally Submitted PDF: pdf
Submission Number: 2695
Loading