Abstract: We present a deep network based hierarchical framework to recognize activities performed by people collectively from videos at various levels of granularity - individual person, group and overall (or scene level). Individual person analysis, which includes person detection, tracking, pose estimation and individual activity recognition, has been studied extensively. Most of the existing work on collective activity recognition has focused on overall scene activity estimation. However, in scenarios where multiple groups perform different group activities, overall scene activity recognition in isolation paints an incomplete picture of various activities. Identifying groups and recognizing their activities is therefore important to understand a scene in it's completeness. To this end, we add an extra layer in existing methods that finds the groups (or clusters) of people present in a scene and their activities. We then utilize these group activities along with the scene context to recognize the scene activity. To discover these groups, we propose a min-max criteria within the framework to train a sub-network which learns pairwise similarity between any two individuals, used by a clustering algorithm for identification of groups. The group activity is captured by an LSTM module whereas the individual and scene activities are captured by CNN-LSTM based modules. These modules along with the grouping layer form the proposed network. We evaluate the network on a publicly available dataset to indicate the usefulness of our approach.
Loading