Keywords: world model, foundation model, flow matching, autoregressive flow matching
TL;DR: Generative forecasting of autoencoded, compact Vision Foundation Model features outperforms regression and naïve diffusion, enabling scalable, uncertainty-aware, and label-free long-horizon scene prediction.
Abstract: Forecasting by generating RGB videos is computationally expensive, often physically implausible, and not directly actionable, since it requires translation into
decision-making signals. Direct modality forecasting (e.g., predicting future segmentation) produces directly actionable outputs but fails to scale due to the need
for labels. Vision Foundation Model (VFM) features offer the best of both worlds:
they contain actionable semantic and geometric information that can be easily
decoded from the predicted features, while requiring no labels on the downstream
task for training. However, almost all existing VFM feature forecasting methods
regress future features from fixed number of input frames, with evaluation predominantly on short horizons matching the training setup. We firstly show that existing
regression methods struggle with forecasting from partial observations because
they average over multiple plausible futures, failing to capture uncertainty in the
future given the past. Interestingly, naively replacing deterministic forecasting
with generative flow matching does not match the sample quality of the regression
model, despite being a mathematically appropriate formulation of the forecasting
task. In this work, we explain why this is the case, and we show how to optimally
generate foundation model features. Our key insight is that generative modeling
of VFM features requires (auto)encoding into a compact latent space suitable for
diffusion. We show that this latent space preserves information more effectively
than previously used alternatives, such as uncompressed feature diffusion or PCAbased compression, both for forecasting and other applications, such as image
generation. Our results suggest that conditional generation of (compressed) VFM
features offers a promising and scalable foundation for future scene forecasters.
Submission Number: 36
Loading