Extending Action Recognition in the Compressed Domain

Samuel Abrams, Vijaykrishnan Narayanan

Published: 2023, Last Modified: 13 Nov 2024VLSID 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: As the internet continues to extend its reach into every facet of society, video is becoming one of the most common mediums for the communication of ideas and information. Working with video in the compressed domain saves resources by avoiding decompression and allows for faster processing due to smaller input streams where redundancy is avoided. Most prior work on compressed video processing has focused on MPEG-4 part-2 codec which is dated in part due to its non-optimized compression ratio. For example, H.264 has a compression ratio <tex>$2\mathrm{x}$</tex> greater than MPEG-4 part-2 codec with improved quality. Due to the increasing prevalence of the more effective H.264 codec for video content, designing a network to infer directly on H.264 compressed video is essential. Hence, we propose a new video analytics architecture that uses only two streams of data from the compressed domain as compared to the three or more generally used for compressed recognition. The proposed architecture, coined Extended Codec Recognition Network (ECRN), is the first approach to our knowledge to support action recognition on both MPEG4 part-2 and H.264 compressed video. It is computationally efficient and achieves competitive accuracy to methods performing recognition solely on MPEG-4 part-2 streams. The ability to achieve competitive accuracy using a modern video codec creates the potential to extend compressed action recognition to a wide range of applications.