Abstract: Temporal action segmentation (TAS) is a critical step toward long-term video understanding. Recent studies based on the pre-extracted video features instead of raw video image in-formation. However, those models are trained complicatedly and limit application scenarios. It is hard for them to segment human actions of video in real time because they must work after the full video features are extracted. As the real-time action segmentation task is different from TAS task, we define it as streaming video real-time temporal action segmentation (SVTAS) task. In this paper, we propose a real-time end-to-end multi-modality model for SVTAS task. More specifically, we incrementally propose three frameworks to solve the SVTAS task and enhance the model performance step-by-step. To the best of our knowledge, it is the first multi-modality real-time temporal action segmentation model. Under the same evaluation criteria as full video TAS, our model segments human action in real time with less than 40% of state-of-the-art model computation and achieves 90% of the accuracy of the full video state-of-the-art model. Code is available at httuS://2ithub.com/Thinkskv5124/SVTAS.2it.
Loading