Abstract: Recognizing daily human activities is essential for domestic robots to assist humans effectively in indoor environments. These activities typically involve sequences of interactions between humans and objects across different locations and times within a household. Identifying these events and understanding their temporal and spatial relationships is crucial for accurately modeling human behavior patterns. However, most current methods and datasets for human activity recognition focus on identifying singular events at specific moments and locations, neglecting the complexity of activities that span multiple times and places. To address this gap, we collected data on human activity patterns over a single day through crowdsourcing. Based on this, we introduce a novel synthetic video question-answering dataset. Our proposed dataset includes videos of daily activities accompanied by question-answer pairs that require models to reason about sequences of activities in both time and space. We evaluated state-of-the-art methods against our dataset, highlighting their limitations in handling the intricate spatio-temporal dynamics of human activity sequences. To improve upon these methods, we propose a two-stage model. The proposed model initially decodes the detailed content of individual videos using a transformer-based approach, then employs LLMs for advanced spatio-temporal reasoning across multiple videos. We hope our research provides valuable benchmarks and insights, paving the way for advancements in the recognition of daily human activity patterns.
Loading