Deep Models modelled after human brain boost performance in action classification

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: applications to neuroscience & cognitive science
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Neuroscience, Cognition, Deep Learning, Action Recognition
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: Neural networks trained to recognise actions from video-frames fail to learn representations of body-pose. Domain-specific neural networks overcome this limitation, have better accuracy and more closely resemble human patterns of responses.
Abstract: Recognizing actions from visual input is a fundamental cognitive ability. Perceiving what others are doing is a gateway to inferring their goals, emotions, beliefs and traits. Action recognition is also key for applications ranging from robotics to healthcare monitoring. Action information can be extracted from the body pose and movements, as well as from the background scene. However, the extent to which deep neural networks make use of information about the body and information about the background remains unclear. In particular, since these two sources of information may be correlated within a training dataset, deep networks might learn to rely predominantly on one of them, without taking full advantage of the other. Unlike deep networks, humans have domain-specific brain regions selective for perceiving bodies, and regions selective for perceiving scenes. The present work tests whether humans are thus more effective at extracting information from both body and background, and whether building brain-inspired deep network architectures with separate domain-specific streams for body and scene perception endows them with more human-like performance. We first demonstrate that deep networks trained using the Human Atomic Actions 500 dataset perform almost as accurately on versions of the stimuli that show both body and background and on versions of the stimuli from which the body was removed, but are at chance-level for versions of the stimuli from which the background was removed. Conversely, human participants (N=28) can recognize the same set of actions accurately with all three versions of the stimuli, and perform significantly better on stimuli that show only the body than on stimuli that show only the background. Finally, we implement and test a novel deep network architecture patterned after domain specificity in the brain, that utilizes separate streams to process body and background information. We show that 1) this architecture improves action recognition performance, and 2) its accuracy across different versions of the stimuli follows a pattern that matches more closely the pattern of accuracy observed in human participants.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7339
Loading