Abstract: This paper investigates a new task, Weakly- supervised Group Activity Recognition in Still-images (WGARS), which aims to extend the applicability of Group Activity Recognition (GAR) to broader scenarios, such as low-latency domains. To tackle this challenge, we propose a Spatial Dual Context Transformer (SDCT), comprising a Dual Context Encoder (DCE) and a Dual Context Decoder (DCD). The DCE module individually encodes holistic context with integral relations of overall actors, and encodes partial context with individual features in still images. Subsequently, the DCD module explores the complementarity between holistic and partial contexts, and alternatively updates these encoded contexts to enhance the interaction of actors. Additionally, auxiliary supervised contrastive learning is incorporated to mitigate activity confusion. The proposed SDCT attains state-of-the-art performance on Volleyball and NBA datasets in WGARS. Notably, SDCT even outperforms recent methods when extended to the weakly-supervised GAR in videos task on Volleyball dataset.
Loading