Abstract: Human action recognition is a widely explored problem in computer vision, with applications in domains such as robotics and surveillance. While most existing methods focus on predicting a single action class, human actions are not isolated events but range from simple movements to complex behaviors. Recent work has approached action recognition from a hierarchical perspective, relying on manually annotated labels available in some domains. In contrast, the Activities of Daily Living (ADL) domain is constrained by a lack of structured datasets for this purpose. In this work, we introduce HierADL, a framework for automatically generating action hierarchies in the ADL domain. HierADL leverages semantic features of action labels and visual features from video clips to group fine-grained actions into coarse-grained categories. These hierarchies are integrated with the HierADL classifier, which allows simultaneous prediction of both fine-grained and coarse-grained actions to improve accuracy, besides being compatible with most video classification architectures. In addition, we conduct an ablation study to identify the most effective method for generating action hierarchies based on visual and semantic cues. We evaluate HierADL hierarchies qualitatively using t-SNE plots, and quantitatively on two ADL datasets: ETRI-Activity and Hierarchical TSU. Our results demonstrate that HierADL outperforms fine-grained-only approaches on several state-of-the-art video backbones, achieving an accuracy improvement of over 3.5%.
External IDs:dblp:conf/ijcnn/BenaventLledoOMRR25
Loading