Reasoning Supervision for Vision Transformers in Human Activity Recognition

Reasoning Supervision for Vision Transformers in Human Activity Recognition

ICLR 2026 Conference Submission13228 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: ViT, Reasoning, XAI, LLMs, HAR, Representation Learning

Abstract: Recent reasoning methods have been explored to improve model transparency and trust, particularly in video understanding, where actions are defined by temporal order, object interactions, and state transitions. However, most approaches remain post-hoc, offering limited opportunity to influence a model’s internal reasoning process or improve its accuracy. In this work, we move beyond post-hoc explanation and introduce a Reasoning Supervision training pipeline that directly enhances model performance. This setting presents unique challenges: how to generate training-time reasoning guidance, what form this guidance should take, and how to inject it effectively into the model. Our framework addresses these challenges by leveraging large language models (LLMs) as proxy annotators to generate high-quality spatial supervision. We introduce two complementary loss functions to inject this guidance into the model: a spatial alignment loss that aligns attention with LLM-derived spatial reasoning guidance and a temporal reasoning loss that encourages coherent, human-like temporal dependencies across frames. Applied to Vision Transformer architectures, Reasoning Supervision consistently improves performance, establishing a simple yet effective paradigm for advancing ViT-based video understanding models.

Supplementary Material: pdf

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 13228

Loading