Keywords: Scene flow estimation, Self-supervised Learning, Point clouds, machine vision
Abstract: Self-supervised feed-forward methods for scene flow estimation offer real-time efficiency, but their supervision from two-frame point correspondences is unreliable and often breaks down under occlusions. Multi-frame supervision has the potential to provide more stable guidance by incorporating motion cues from past frames, yet naive extensions of two-frame objectives are ineffective because point correspondences vary abruptly across frames, producing inconsistent signals.
In the paper, we present TeFlow, enabling multi-frame supervision for feed-forward models by mining temporally consistent supervision.
TeFlow introduces a temporal ensembling strategy that forms reliable supervisory signals by aggregating the most temporally consistent motion cues from a candidate pool built across multiple frames.
Extensive evaluations demonstrate that TeFlow establishes a new state-of-the-art for self-supervised feed-forward methods, achieving performance gains of **up to 33\%** on the challenging Argoverse 2 and nuScenes datasets. Our method performs on par with leading optimization-based methods, yet speeds up **150** times. The source code and model weights will be released upon publication.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6220
Loading