Disentangling Intent: Sparse Autoencoders for Interpretable Action Transition Prediction in Egocentric Video

28 Apr 2026 (modified: 28 Apr 2026)THU 2026 Spring ANM SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: egocentric video, sparse autoencoders, mechanistic interpretability, V-JEPA, action transition prediction, intent forecasting
TL;DR: We apply sparse autoencoders to V-JEPA egocentric video embeddings to decompose latent representations into human-interpretable concepts, revealing what a video foundation model encodes about human intent and action transitions.
Abstract: As wearable computing devices—smart glasses, AR headsets—become ubiquitous, there is growing demand for proactive AI systems capable of anticipating a user's next action before it occurs. The foundation for such systems lies in egocentric video: the first-person visual stream captured by body-mounted cameras, rich in hand–object interactions, gaze patterns, and environmental context. Recent self-supervised models, most notably V-JEPA, can produce dense 768-dimensional latent representations that encode powerful spatiotemporal features without requiring human labels. When paired with supervised classifiers such as MLPs trained on Ego4D annotations, these representations enable promising action-transition prediction. However, such pipelines remain fundamentally opaque: the classifier maps the dense embedding to a label without revealing which aspects of the representation drive the prediction.
Submission Number: 10
Loading