SpatioTemporal-GRPO: Post-Training Large Multimodal Models for Video QA

Emad Bahrami; Olga Zatsarynna; Parth Pathak; Sunando Sengupta; Juergen Gall; Mohsen Fayyaz

SpatioTemporal-GRPO: Post-Training Large Multimodal Models for Video QA

Emad Bahrami, Olga Zatsarynna, Parth Pathak, Sunando Sengupta, Juergen Gall, Mohsen Fayyaz

19 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Question Answering, Large Multimodal Models, Post-Training

Abstract: We introduce SpatioTemporal-GRPO (ST-GRPO), a novel extension of the GRPO algorithm for video question answering. ST-GRPO addresses a limitation of standard GRPO: when all responses in a group have similar correctness, the low reward variance gives the model an uninformative signal for improvement. Our method overcomes this by generating multiple spatiotemporal variants of a video to serve as complementary inputs. Unlike standard GRPO, which only groups textual responses, ST-GRPO forms groups across both textual and spatiotemporal variants. This increases reward variance within each group, providing a more informative signal for learning. To ensure these visual variations are meaningful, we propose an importance-based grouping strategy. This approach computes per-frame relevance scores using cross-modal embeddings, prioritizing frames that carry higher semantic weight relative to the question. This question-aware method ensures our spatiotemporal groups are informed by the relevant visual cues for each query. Our experiments demonstrate consistent improvements across six challenging video understanding benchmarks, including VideoMME, TempCompass, VideoMMMU, MMVU, VSI-Bench, and PerceptionTest, showing that incorporating structured visual diversity into reinforcement learning provides a more effective approach for learning from spatiotemporal cues in video question answering.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 18206

Loading