Keywords: Egocentric vision, Vision-language models, Temporal reasoning
TL;DR: We introduce EGO-FLIGHT, a benchmark and dataset of egocentric video clips for evaluating vision-language models on temporal frame-sorting tasks, and show that two fine-tuning strategies improve model performance.
Abstract: Multimodal Large Language Models have shown impressive progress across vision-language tasks, but they still struggle with temporal reasoning, a critical skill for understanding dynamic visual content. We introduce EGO-FLIGHT, a benchmark and dataset designed to directly evaluate temporal reasoning in Vision-Language Models (VLMs) through frame-ordering tasks in human-like, first-person visual contexts. Our dataset contains 1,056 continuous, egocentric video clips that capture natural variations in lighting, motion, and occlusion, providing a first-person perspective that mirrors how humans experience and interpret dynamic scenes. Experiments on frame-sorting tasks varying four controlled variables reveal that current models perform substantially below the human baseline, though longer videos, fewer frames, and more annotation generally improve performance. Finally, applying two LoRA fine-tuning strategies to a VLM trained on our hand-collected data improves performance over the base model, providing a promising path toward enhancing temporal reasoning capabilities. We hope this work advances research on temporal understanding and encourages the development of models that more closely align with human perception while supporting realistic learning in embodied systems such as robots.
Submission Number: 41
Loading