Gp3Former: Gaussian Prior Tri-Cascaded Transformer for Video Instance Segmentation in Livestreaming Scenarios

Wensheng Li, Jing Zhang, Li Zhuo

Published: 2025, Last Modified: 06 Nov 2025IEEE Trans. Emerg. Top. Comput. Intell. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Livestreaming platforms attract many active streamers and daily users, and their public opinion power poses a major challenge to network regulation. Video scene understanding can promote the efficiency and quality of network regulation, in which video instance segmentation is a fundamental task for scene understanding. Due to the presence of small, dense instances and fast-changing scenes in livestreaming scenarios, we propose a Gaussian prior tri-cascaded Transformer Gp3Former for video instance segmentation. First, the Mask2Former-VIS encoder is used to enhance the representation of video features at different scales for small instance segmentation. Then, a tri-cascaded Transformer decoder is designed to adapt to the fast-changing scenes in livestreaming, which can extract global, balanced, and local instance features while sacrificing as little scene information as possible. Finally, to cope with the dense instances in livestreaming, a Gaussian prior is imposed during instance association and segmentation to learn the Gaussian distribution of a series of cross-frame instances. The experimental results show that with an inference efficiency of 19.6 FPS, the proposed method reaches 50.6%AP, 50.0%AR on YouTube-VIS 2019, and 82.9%AP, 82.3%AR on self-built BJUT-LSD, respectively, which is effective and superior for video instance segmentation of livestreaming scenarios.