SPARK: Simple Post-training for Adapting pRetrained Knowledge to Robot Control

Published: 24 May 2026, Last Modified: 24 May 2026ScaleBot @ CVPR 2026EveryoneRevisionsCC BY 4.0
Keywords: representation learning
TL;DR: we propose SPARK, a post-training method that adapts foundation models and general-purpose vision models to robot control.
Abstract: Large scale representation learning has produced public foundation models that provide strong general purpose visual features. However, their pretraining objectives are not designed for robot representation learning, so the resulting embeddings tend to emphasize global semantics rather than the temporally sensitive representations required for robot control. Training robot models from scratch is also undesirable due to the limited scale and high collection cost of robotics data, as well as the substantial computational burden of pre-training. In this paper, we propose SPARK, (Simple Post-training for Adapting pRetrained Knowledge), a post training method that adapts foundation models to robot control. SPARK is built on two principles: dynamics-aware abstraction, which encourages the encoder to derive compact features that preserve information essential for understanding temporal changes and action-relevant scene structure, and knowledge preservation, which aligns patch-level representations with those of the original foundation model to retain useful pretrained semantics. These objectives yield compact visual state representations that remain semantically meaningful while preserving useful prior knowledge. Experiments across multiple robotics benchmarks show that SPARK consistently improves success rates and generalization over vanilla foundation models, and further demonstrate that these gains transfer to real-world robot manipulation.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 23
Loading