Abstract: Vision-Language-Action (VLA) models have
emerged as powerful generalist policies for robotic control,
yet their performance scaling across model architectures and
hardware platforms, as well as their associated power budgets,
remain poorly understood. This work presents an evaluation
of five representative VLA models—spanning state-of-the-art
baselines and two newly proposed architectures—targeting edge
and datacenter GPU platforms. Using the LIBERO benchmark,
we measure accuracy alongside system-level metrics, including
latency, throughput, and peak memory usage, under varying
edge power constraints and high-performance datacenter GPU
configurations. Our results identify distinct scaling trends: (1)
architectural choices, such as action tokenization and model
backbone size, strongly influence throughput and memory
footprint; (2) power-constrained edge devices exhibit non-linear
performance degradation, with some configurations matching
or exceeding older datacenter GPUs; and (3) high-throughput
variants can be achieved without significant accuracy loss.
These findings provide actionable insights when selecting and
optimizing VLAs across a range of deployment constraints. Our
work challenges current assumptions about the superiority of
datacenter hardware for robotic inference.
Loading