Abstract: We consider the problem of reconstructing a dynamic scene observed from a stereo camera. Most existing meth-ods for depth from stereo treat different stereo frames in-dependently, leading to temporally inconsistent depth pre-dictions. Temporal consistency is especially important for immersive AR or VR scenarios, where flickering greatly di-minishes the user experience. We propose DynamicStereo, a novel transformer-based architecture to estimate dispar-ity for stereo videos. The network learns to pool information from neighboring frames to improve the temporal consistency of its predictions. Our architecture is designed to process stereo videos efficiently through divided attention layers. We also introduce Dynamic Replica, a new bench-mark dataset containing synthetic videos of people and ani-mals in scanned environments, which provides complemen-tary training and evaluation data for dynamic stereo closer to real applications than existing datasets. Training with this dataset further improves the quality of predictions of our proposed DynamicStereo as well as prior methods. Finally, it acts as a benchmark for consistent stereo methods. Project page: https://dynamic-stereo.github.io/
Loading