Keywords: Video world models, Robotics and embodied AI, Efficient and real-time generation, Action-conditioned video prediction, Sparse attention, Trajectory generation
TL;DR: Built on Cosmos Predict 2.5, CoDA uses DCT-guided adaptive sparse attention to cut visual tokens by up to 50.7% while preserving or improving robotic trajectory accuracy.
Abstract: Large-scale pretrained world models, most notably NVIDIA Cosmos Predict 2.5, provide powerful trajectory generation capabilities for robotic manipulation. Yet this power comes at a steep computational cost: applying full attention across all timesteps becomes a direct bottleneck in real-world robot deployment. We observe that a robot does not need to “think hard” at every moment equally. Recent work on action tokenization (e.g., the FAST framework) has already demonstrated that the high-frequency DCT components of an action sequence effectively encode action complexity. We repurpose this insight into dynamic attention control: at moments when DCT high-frequency energy is strong, when the robot demands precise, complex reasoning, we retain more visual tokens; in less demanding intervals, we aggressively reduce them. This modulation is applied per-chunk, leaving open the possibility of further acceleration in real-time deployment scenarios. Building on this, we propose CoDA (Cosmos-driven DCT-Adaptive Sparse Attention), which applies a DCT-based dynamic token budget to the cross-attention layers of Cosmos Predict 2.5. We introduce two gating variants: CoDA-S, a supervised gate that mimics DCT-derived complexity scores, and CoDA-A, an autonomous gate that self-optimizes the accuracy–efficiency trade-off via a penalty-regularized objective. Both variants are fine-tuned from a full-attention model (the baseline) via warm-start, preserving the backbone’s manipulation capability. On the MimicGen Square task, compared to this baseline, CoDA-S reduces visual tokens by 50.7% without degrading trajectory accuracy, and CoDA-A reduces visual tokens by 25.8% while improving GT MSE by ∼30% (0.51 → 0.36). Eliminating unnecessary tokens suppresses attention noise, thereby improving trajectory accuracy itself. Our results demonstrate that adaptive sparsity can simultaneously achieve computational reduction and trajectory accuracy improvement in 7-DoF robotic manipulation.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 9
Loading