Abstract: Models optimized for accuracy on single images are often prohibitively slow to
run on each frame in a video, especially on challenging dense prediction tasks,
such as semantic segmentation. Recent work exploits the use of optical flow to
warp image features forward from select keyframes, as a means to conserve computation
on video. This approach, however, achieves only limited speedup, even
when optimized, due to the accuracy degradation introduced by repeated forward
warping, and the inference cost of optical flow estimation. To address these problems,
we propose a new scheme that propagates features using the block motion
vectors (BMV) present in compressed video (e.g. H.264 codecs), instead of optical
flow, and bi-directionally warps and fuses features from enclosing keyframes
to capture scene context on each video frame. Our technique, interpolation-BMV,
enables us to accurately estimate the features of intermediate frames, while keeping
inference costs low. We evaluate our system on the CamVid and Cityscapes
datasets, comparing to both a strong single-frame baseline and related work. We
find that we are able to substantially accelerate segmentation on video, achieving
near real-time frame rates (20+ frames per second) on large images (e.g. 960 x 720
pixels), while maintaining competitive accuracy. This represents an improvement
of almost 6x over the single-frame baseline and 2.5x over the fastest prior work.
Keywords: semantic segmentation, video, efficient inference, video segmentation, video compression
TL;DR: We exploit video compression techniques (in particular, the block motion vectors in H.264 video) and feature similarity across frames to accelerate a classical image recognition task, semantic segmentation, on video.
Data: [CamVid](https://paperswithcode.com/dataset/camvid), [Cityscapes](https://paperswithcode.com/dataset/cityscapes)
8 Replies
Loading