Abstract: The performance of deep neural networks is strongly
influenced by the quantity and quality of annotated data.
Most of the large activity recognition datasets consist of
data sourced from the web, which does not reflect challenges that exist in activities of daily living. In this paper,
we introduce a large real-world video dataset for activities
of daily living: Toyota Smarthome. The dataset consists of
16K RGB+D clips of 31 activity classes, performed by seniors in a smarthome. Unlike previous datasets, videos were
fully unscripted. As a result, the dataset poses several challenges: high intra-class variation, high class imbalance,
simple and composite activities, and activities with similar motion and variable duration. Activities were annotated
with both coarse and fine-grained labels. These characteristics differentiate Toyota Smarthome from other datasets
for activity recognition. As recent activity recognition approaches fail to address the challenges posed by Toyota
Smarthome, we present a novel activity recognition method
with attention mechanism. We propose a pose driven spatiotemporal attention mechanism through 3D ConvNets. We
show that our novel method outperforms state-of-the-art
methods on benchmark datasets, as well as on the Toyota
Smarthome dataset. We release the dataset for research
use.
0 Replies
Loading