Stacked encoder–decoder transformer with boundary smoothing for action segmentation

Gyeong-Hyeon Kim, Eunwoo Kim

Published: 28 Nov 2022, Last Modified: 23 Apr 2026OpenReview Archive Direct UploadEveryoneRevisionsCC BY-NC-ND 4.0

Abstract: In this work, a new stacked encoder–decoder transformer (SEDT) model is proposed for action segmentation. SEDT is composed of aseries of encoder–decoder modules, each of which consists of an en-coder with self-attention layers and a decoder with cross-attention lay-ers. By adding an encoder with self-attention before every decoder, itpreserves local information along with global information. The pro-posed encoder–decoder pair also prevents the accumulation of errorsthat occur when features are propagated through decoders. Moreover,the approach performs boundary smoothing in order to handle ambigu-ous action boundaries. Experimental results for two popular benchmarkdatasets, “GTEA” and “50 Salads”, show that the proposed model ismore effective in performance than existing temporal convolutional net-work based models and the attention-based model, ASFormer.