Spatial-Temporal Transformer Network for Continuous Action Recognition in Industrial Assembly

Jianfeng Huang, Xiang Liu, Huan Hu, Shanghua Tang, Chenyang Li, Shaoan Zhao, Yimin Lin, Kai Wang, Zhaoxiang Liu, Shiguo Lian

Published: 2024, Last Modified: 09 Jan 2026ICIC (10) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Now, it is still an open issue to automatically detect whether the worker’s manual operations are compliant with the standard in industrial assembly. In this paper, we first present a spatio-temporal Transformer network (STTN) to recognize each action as a manual operation in an assembly line by combining self-attention and cross-attention for extracting the interaction between human and object. Then, an action sequence recognition scheme is proposed to flexibly define the standard operation steps adaptive for various assembly cases and decide whether the worker’s continuous operations are compliant with the standard step-by-step. Additionally, to improve the generalization, we present a continual learning scheme to refine the STTN model with worker-in-the-loop. Comparative experimental results show that our STTN achieves state- of-the-art performances both on the public VidOR dataset and our own practical assembly dataset. What’s more, the practical implementation of a dish-washing machine assembly line shows that our method can help promote both product quality and production efficiency. Our Industrial Assembly Dataset (IndAD) with the video data collected from several practical assembly lines is now the largest open dataset for action recognition in assembly scenarios.