Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments

Published: 26 Jan 2026, Last Modified: 02 Mar 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Microgravity, Action Recognition, Vision-Language Understanding
TL;DR: MicroG-4M is the first benchmark for human action and scene understanding in microgravity, offering clips, captions, and QA pairs to reveal domain gaps and guide robust vision-language models.
Abstract: Despite substantial progress in video understanding, most existing datasets are limited to Earth’s gravitational conditions. However, microgravity alters human motion, interactions, and visual semantics, revealing a critical gap for real-world vision systems. This presents a challenge for domain-robust video understanding in safety-critical space applications. To address this, we introduce MicroG-4M, the first benchmark for spatio-temporal and semantic understanding of human activities in microgravity. Constructed from real-world space missions and cinematic simulations, the dataset includes $4{,}759$ clips with $13{,}261$ action annotations covering $50$ actions, $1{,}238$ context-rich captions, and over $7{,}000$ question–answer pairs on astronaut activities and scene understanding. MicroG-4M aims to support three core tasks: fine-grained multi-label action recognition, temporal video captioning, and visual question answering, thereby enabling a comprehensive evaluation of both spatial localization and semantic reasoning in microgravity contexts. We establish baselines using state-of-the-art models. All data, annotations, and code are available at https://github.com/lei-qi-233/MicroG-4M.
Primary Area: datasets and benchmarks
Submission Number: 16618
Loading