Abstract: In this work, we target cross-domain action recognition (CDAR) in the video domain and propose a novel end-to-end pairwise two-stream ConvNets ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">PTC</i> ) algorithm for real-life conditions, in which only a few labeled samples are available. To cope with the limited training sample problem, we employ pairwise network architecture that can leverage training samples from a source domain and, thus, requires only a few labeled samples per category from the target domain. In particular, a frame self-attention mechanism and an adaptive weight scheme are embedded into the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">PTC</i> network to adaptively combine the RGB and flow features. This design can effectively learn domain-invariant features for both the source and target domains. In addition, we propose a sphere boundary sample-selecting scheme that selects the training samples at the boundary of a class (in the feature space) to train the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">PTC</i> model. In this way, a well-enhanced generalization capability can be achieved. To validate the effectiveness of our <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">PTC</i> model, we construct two CDAR data sets ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">SDAI Action I</i> and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">SDAI Action II</i> ) that include indoor and outdoor environments; all actions and samples in these data sets were carefully collected from public action data sets. To the best of our knowledge, these are the first data sets specifically designed for the CDAR task. Extensive experiments were conducted on these two data sets. The results show that <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">PTC</i> outperforms state-of-the-art video action recognition methods in terms of both accuracy and training efficiency. It is noteworthy that when only two labeled training samples per category are used in the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">SDAI Action I</i> data set, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">PTC</i> achieves 21.9% and 6.8% improvement in accuracy over two-stream and temporal segment networks models, respectively. As an added contribution, the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">SDAI Action I</i> and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">SDAI Action II</i> data sets will be released to facilitate future research on the CDAR task.
0 Replies
Loading