Abstract: In the field of video depth estimation, significant strides have been made with deep learning-based multi-view stereo approaches. However, existing studies struggle to produce consistently accurate depth maps that account for both multi-view geometry and temporal consistency from monocular video contents. To overcome this limitation, we introduce CMVDE, an innovative video depth estimation framework that leverages a multi-view geometric-temporal coupling approach in an end-to-end manner. Our proposed geometric consistency module efficiently generates multi-view geometric features by employing mutual cross-view epipolar attention between adjacent video frames. Additionally, it compresses these features using the novel multi-scale feature compressor, producing an effective input tensor for the subsequent module. Moreover, our framework enhances temporal consistency across consecutive video frames with the temporal consistency module based on convolutional LSTM [1] leveraging previous depth information as geometric guidance. Compared to state-of-the-art models, our approach achieves superior performance in depth quality and consecutive consistency on the ScanNet [2] and 7-Scenes [3] datasets, surpassing previous multi-view video depth estimation methods.
Loading