SpatioTemporal-GRPO: Post-Training Large Multimodal Models for Video QA

ICLR 2026 Conference Submission18206 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video Question Answering, Large Multimodal Models, Post-Training
Abstract: We introduce SpatioTemporal-GRPO (ST-GRPO), a novel extension of the GRPO algorithm for video question answering. ST-GRPO addresses a limitation of standard GRPO: when all responses in a group have similar correctness, the low reward variance gives the model an uninformative signal for improvement. Our method overcomes this by generating multiple spatiotemporal variants of a video to serve as complementary inputs. Unlike standard GRPO, which only groups textual responses, ST-GRPO forms groups across both textual and spatiotemporal variants. This increases reward variance within each group, providing a more informative signal for learning. To ensure these visual variations are meaningful, we propose an importance-based grouping strategy. This approach computes per-frame relevance scores using cross-modal embeddings, prioritizing frames that carry higher semantic weight relative to the question. This question-aware method ensures our spatiotemporal groups are informed by the relevant visual cues for each query. Our experiments demonstrate consistent improvements across six challenging video understanding benchmarks, including VideoMME, TempCompass, VideoMMMU, MMVU, VSI-Bench, and PerceptionTest, showing that incorporating structured visual diversity into reinforcement learning provides a more effective approach for learning from spatiotemporal cues in video question answering.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 18206
Loading