Hierarchical and Multimodal Data for Daily Activity Understanding

Published: 11 Mar 2026, Last Modified: 11 Mar 2026Accepted by DMLREveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Daily Activity Recordings for artificial intelligence (DARai, pronounced /Dahr-ree/), is a multimodal, hierarchically annotated dataset constructed to understand human activities in real-world settings. DARai consists of continuous scripted and unscripted recordings of 50 participants in 10 different environments, totaling over 200 hours of data from 20 sensors including multiple camera views, depth and radar sensors, wearable inertial measurement units (IMUs), electromyography (EMG), insole pressure sensors, biomonitor sensors, and gaze tracker. To capture the complexity in human activities, DARai is annotated at three levels of hierarchy: (i) high-level activities (L1) that are independent tasks, (ii) lower-level actions (L2) that are patterns shared between activities, and (iii) fine-grained procedures (L3) that detail the exact execution steps for actions. The unscripted nature of DARai enables the collection of action counterfactuals, defined as observed alternative executions of the same activity under different conditions (e.g., lifting a heavy versus a light object). Experiments with various machine learning models showcase the value of DARai in uncovering important challenges in human-centered applications. Specifically, we conduct unimodal and multimodal sensor fusion experiments for recognition, temporal localization, and future action anticipation across all hierarchical annotation levels. To showcase the shortcomings of individual sensors, we conduct domain-variant experiments that are possible because of DARai’s multi-sensor and and its inclusion of action counterfactuals, i.e., observed alternative executions of the same activity. The code, documentation, and dataset is available at the dedicated DARai website.
Keywords: Multimodal Fusion, Temporal Sequence Modeling, Cross-View Domain Adaptation, Multi-sensor Integration, Hierarchical Learning, Hierarchical Activity Recognition, Time-Series Analysis, Real-World Environments, Action Anticipation, Action Segmentation
Changes Since Last Submission: The manuscript has been revised with the following : - Abstract: Revised first mention to "action counterfactuals" with a concrete example. - Introduction (Sec. 1): Added a definition paragraph with two illustrative cases and an explicit note distinguishing our usage from causal counterfactuals. - Added "Across all tasks, our baseline set includes transformer-based models alongside convolutional and recurrent approaches." to Introduction, Page 4, Number 5. - Figure 1 captions - Section 2 (Related Work): Replaced "counterfactual activities" with "action counterfactuals," emphasizing that these are observed alternatives rather than causal constructs. - Section 3.1 (Data collection): Revised phrasing to "action counterfactual activity instances." - Section 7.3 (New ablation, Fig. 12): Added experiments comparing models trained with vs. without action counterfactuals, illustrating the role of L2/L3 variations in anticipation. - Contributions (Item 3): Updated wording to: "action counterfactual scenarios" - Conclusion/Discussion: Revised for consistent terminology. - Discussion (Sec. 8): Added remarks on why L2 and L3 capture distinct aspects of activity understanding and their relevance to real-world applications. - Section 4 (Task Interfaces): Added a short paragraph listing the three interfaces and their supported tasks. - Section 3.3 — Data Preprocessing is expanded in the main manuscript adding concise descriptions of alignment, imputation policy, normalization (with equation), and segmentation/export details with brief modality-specific notes. - Section 3.4 – Ethics and responsible use has been added to the revised manuscript: We added content that describes IRB approval and consent, video/speech anonymization, removal of personally identifiable information, and research-use policy. Comment #3 “Include more benchmark metrics in addition to total accuracy (could be included F1-score, example confusion matrix etc.)” - Sections 5–7 (Evaluation): We added F1 scores alongside accuracy. We also included representative confusion matrices in the main manuscript. - Sections 1–3 (first-use locations): Expanded all abbreviations at first mention in the manuscript body and in figure/table captions. - We restructured our conclusion into two separate sections on Discussion and Conclusions. - Broader impact
Code: https://github.com/olivesgatech/DARai
Assigned Action Editor: ~Sergio_Escalera1
Submission Number: 106
Loading