Context Helps: Integrating Context Information with Videos in a Graph-Based HAR Framework

Published: 01 Jan 2024, Last Modified: 30 Sept 2024NeSy (1) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Human Activity Recognition (HAR) from videos is a challenging, data intensive task. There have been significant strides in recent years, but even state-of-the-art (SoTA) models rely heavily on domain specific supervised fine-tuning of visual features, and even with this data- and compute-intensive fine-tuning, overall performance can still be limited. We argue that the next generation of HAR models could benefit from explicit neuro-symbolic mechanisms in order to flexibly exploit rich contextual information available in, and for, videos. With a view to this, we propose a Human Activity Recognition with Context Prompt (HARCP) task to investigate the value of contextual information for video-based HAR. We also present a neuro-symbolic graph neural network-based framework that integrates zero-shot object localisation to address the HARCP task. This captures the human activity as a sequence of graph-based scene representations relating parts of the human body to key objects, supporting the targeted injection of external contextual knowledge in symbolic form. We evaluate existing HAR baselines alongside our graph-based methods to demonstrate the advantage of being able to accommodate this additional channel of information. Our evaluations show that not only does context information from key objects boost accuracy beyond that provided by SoTA HAR models alone, there is also a greater semantic similarity between our model’s errors and the target class. We argue that this represents an improved model alignment with human-like errors and quantify this with a novel measure we call Semantic Prediction Dispersion.
Loading