Abstract: The dense motion semantics in talking-head videos can be encoded as sparse keypoints for low-bitrate semantic source coding. Previous methods extracted keypoints that describe fixed part semantics from faces, typically at the eyes, nose, and mouth. However, these fixed keypoints cannot be adaptively adjusted according to varied videos, limiting the efficiency of source coding. In this paper, we propose adaptive keypoints that perceptively express facial movements for flexible talking-head semantic source coding. The superiority of the adaptive keypoints is theoretically substantiated through likelihood functions. To achieve adaptive keypoint extraction, we designed a detector with a transformer architecture and regression layers, allowing keypoints sampled from the global context to not be confined to fixed areas of the face, breaking the conventional understanding of keypoints. Additionally, we can adjust the bitrate according to the number of keypoints, achieving flexible source coding. Extensive experiments demonstrate that our semantic source coding approach achieves better rate-distortion performance than existing state-of-the-art methods.
External IDs:dblp:conf/wcnc/JingBLWZB25
Loading