Abstract: In this work, we present an ensemble based sign video recognition method. Our proposed method uses different input representations – such as RGB video and body key-points or pose data – to model sign videos in a multi-modal manner. We represent an input sign video in two ways: the dense frame and the sparse frame inputs. The dense input uses 3D Convolutional Neural Network (CNN) on a 64 frame input window and Long Short Term Memory (LSTM) Network on 32 frame pose input. The sparse input picks 5 representative frames from a sign video, and utilizes CNN and Graph Convolutional Network (GCN) based modeling. These representative frames for a video are selected using pose confidences that are obtained from an off-the-shelf pose estimation model. Our experimental results show that, while the dense 3D CNN model achieves best performance as a single classifier, the GCN based sparse model provides extra recognition capacity. More specifically, the sparse modeling source, when added with the dense modeling in an ensemble manner, can disambiguate similar looking sign classes. Our proposed multi-source ensemble method outperforms several state-of-the-art methods on AUTSL Turkish sign language benchmark dataset.
0 Replies
Loading