Abstract: This paper addresses the sign video interpretation which is a weakly supervised task. Each sign action in videos lacks exact boundaries or labels. We design a Parallel Temporal Encoder (PTEnc) to learn the temporal relation of a sign video from local and global sequential learning views in parallel. PTEnc utilizes the complementarity between the local and global temporal cues. Then, fused encoded feature sequence is fed into a Connectionist Temporal Classification (CTC) based sentence decoder. In addition, in order to enhance the temporal cues in each video, we introduce a reconstruction loss, which performs in an unsupervised way without additional labels. The CTC loss cooperates with the reconstruction loss in an end-to-end training manner. Experimental results on a benchmark dataset demonstrate the effectiveness of the proposed method.
0 Replies
Loading