Keywords: ViT, Reasoning, XAI, LLMs, HAR, Representation Learning
Abstract: Recent reasoning methods have been explored to improve model transparency and trust, particularly in video understanding, where actions are defined by temporal order, object interactions, and state transitions. However, most approaches remain post-hoc, offering limited opportunity to influence a model’s internal reasoning process or improve its accuracy. In this work, we move beyond post-hoc explanation and introduce a Reasoning Supervision training pipeline that directly enhances model performance. This setting presents unique challenges: how to generate training-time reasoning guidance, what form this guidance should take, and how to inject it effectively into the model. Our framework addresses these challenges by leveraging large language models (LLMs) as proxy annotators to generate high-quality spatial supervision. We introduce two complementary loss functions to inject this guidance into the model: a spatial alignment loss that aligns attention with LLM-derived spatial reasoning guidance and a temporal reasoning loss that encourages coherent, human-like temporal dependencies across frames. Applied to Vision Transformer architectures, Reasoning Supervision consistently improves performance, establishing a simple yet effective paradigm for advancing ViT-based video understanding models.
Supplementary Material: pdf
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 13228
Loading