Jointly Modeling Static Visual Appearance and Temporal Pattern for Unsupervised Video Hashing

Chao Li, Yang Yang, Jiewei Cao, Zi Huang

2017 (modified: 09 Nov 2022)CIKM 2017Readers: Everyone

Abstract: Recently, hashing has been evidenced as an efficient and effective method to facilitate large-scale video retrieval. Most of existing hashing methods are based on visual features, which are expected to capture the appearance of videos. The intrinsic temporal pattern embedded in videos has also shown its discriminative power for similarity search, and is explored and utilised in some recent studies. However, how to leverage the strengths in both aspects remains unknown. In this paper, we propose to jointly model static visual appearance and temporal pattern for video hash code generation, as both of them are believed to be carrying important information for learning an effective hash function. A novel unsupervised video hashing framework is designed correspondingly, where its hash function is comprised of two encoders including the temporal encoder and the appearance encoder. The two encoders are learned by self-supervision and designed to be able to reconstruct the temporal pattern of videos and visual appearance of frames respectively. Last but not least, for jointly learning of the two encoders, we impose three learning criteria including minimal binarization loss, balanced hash codes and independent hash codes. From the extensive experiments conducted on two large-scale video datasets (i.e. FCVID and ActivityNet), we have confirmed the superior performance of our method comparing to the state-of-the-art video hashing methods.

0 Replies