Abstract: This paper addresses the task of group activity
recognition in multi-person videos. Existing approaches decompose this task into feature learning and relational reasoning.
Despite showing progress, these methods only rely on appearance features for people and overlook the available contextual
information, which can play an important role in group activity
understanding. In this work, we focus on the feature learning
aspect and propose a two-stream architecture that not only
considers person-level appearance features, but also makes use
of contextual information present in videos for group activity
recognition. In particular, we propose to use two types of
contextual information beneficial for two different scenarios: pose
context and scene context that provide crucial cues for group
activity understanding. We combine appearance and contextual
features to encode each person with an enriched representation.
Finally, these combined features are used in relational reasoning
for predicting group activities. We evaluate our method on
two benchmarks, Volleyball and Collective Activity and show
that joint modeling of contextual information with appearance
features benefits in group activity understanding
0 Replies
Loading