Most of This Video Is Boring

Published: 24 Apr 2026, Last Modified: 01 Jun 2026VisCon 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: GUI video encoding, transition detection, event-driven compression, computer-use agents, visual concepts
TL;DR: Exploit the episodic structure of GUI video by encoding only transition events, achieving 200× compression and superior QA accuracy over uniform sampling at matched token budgets.
Abstract: Screen recordings exhibit episodic structure: sparse transition events separated by stationary intervals carrying near-zero new information. We present Asuncion, an event-driven encoder that represents each transition as a structured visual concept tuple ⟨before, after, locus, type⟩, discarding stationary content entirely. This yields 200× compression over naive tokenization and outperforms both uniform subsampling and LongVU on GUI-World QA at matched 52K-token budgets.
Submission Number: 39
Loading