End-to-End Action Segmentation Transformer

Tieqiao Wang, Sinisa Todorovic

Published: 23 Oct 2025, Last Modified: 26 Jan 2026ICCV 2025EveryoneCC BY 4.0

Abstract: Most recent work on action segmentation relies on precomputed frame features from models trained on other tasks and typically focuses on framewise encoding and labeling without explicitly modeling action segments. To overcome these limitations, we introduce the End-to-End Action Segmentation Transformer (EAST), which processes raw video frames directly – eliminating the need for pre-extracted features and enabling true end-to-end training. Our contributions are as follows: (1) a lightweight adapter design for effective fine-tuning of large backbones; (2) an efficient segmentation-by-detection framework for leveraging action proposals predicted over a coarsely downsampled video; and (3) a novel action-proposal-based data augmentation strategy. EAST achieves SOTA performance on standard benchmarks, including GTEA, 50Salads, Breakfast, and Assembly-101.