Abstract: Highlights•We propose TeST, containing three transformer-based architecture variants, to conduct temporal action localization.•The three transformer-based architectures can effectively improve localization performance and space–time efficiency.•We propose to integrate the results from multiple feature maps to obtain more comprehensive predictions.•Extensive experiments on two real-world benchmarks validate the effectiveness and superiority of our proposed TeST.