Abstract: Most recent work on action segmentation relies on precomputed frame features from models trained on other tasks
and typically focuses on framewise encoding and labeling
without explicitly modeling action segments. To overcome
these limitations, we introduce the End-to-End Action Segmentation Transformer (EAST), which processes raw video
frames directly – eliminating the need for pre-extracted features and enabling true end-to-end training. Our contributions are as follows: (1) a lightweight adapter design
for effective fine-tuning of large backbones; (2) an efficient segmentation-by-detection framework for leveraging
action proposals predicted over a coarsely downsampled
video; and (3) a novel action-proposal-based data augmentation strategy. EAST achieves SOTA performance on
standard benchmarks, including GTEA, 50Salads, Breakfast, and Assembly-101.
Loading