Abstract: Understanding and modeling the relationship between the noun (subject) and the verb (action) is gaining importance for recognizing action in egocentric videos, in recent years. The existing methods focus on modeling the noun and verb relationship either through additional modalities or by increasing the dimensions of the CNN architectures. Whereas, we aim to extract features from only the visual information, i.e., the raw videos, considering the difficulties in getting audio and gaze information in real-life scenarios. In this work, we introduce a novel backbone architecture and a training paradigm for activity recognition in first-person videos. We train the proposed architecture in an end-to-end fashion using a dual-level fusion (intermediate and late fusion) for modeling the subject-action relevance. We perform our experiments on a benchmark dataset and show its efficacy with respect to the current state-of-the-art methods. We further discuss the possibilities of using the proposed network and its training process for obtaining pre-trained feature representations for other video recognition tasks.
External IDs:dblp:conf/ncc/PrabhakarRM25
Loading