Abstract: Recent advances in video generation have revealed an unexpected capability: diffusion-based video models can perform reasoning through Chain-of-Frame (CoF), suggesting that reasoning unfolds sequentially across video frames. In this work, we revisit this assumption and uncover a different mechanism. Through systematic analysis, we show that video reasoning primarily develops along the diffusion denoising steps instead, validated through qualitative analysis and probing tests. We term this mechanism Chain-of-Steps (CoS). Our investigation reveals several intriguing emergent behaviors that are critical for the success of video reasoning: 1) Models exhibit a form of long-horizon memory that supports tasks requiring persistent reference, such as object permanence. 2) They can self-correct intermediate mistakes or enhance during generation instead of committing to incorrect trajectories. 3) Analysis of Diffusion Transformer layers shows the emergence of task-agnostic functional specialization: early layers focus on dense perceptual grounding, specific middle layers conduct key reasoning procedures, and later layers consolidate the latent representation for each denoising step. Motivated by these insights, we propose a simple training-free strategy that ensembles reasoning paths by merging latents from identical models with different random seeds at inference time. This approach encourages the exploration of diverse reasoning trajectories and improves reasoning performance. Together, our findings provide the first systematic dissection of the mechanisms underlying video reasoning and offer practical insights for developing more capable video reasoning models.
Loading