SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One Mode

Published: 20 Jul 2024, Last Modified: 06 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In the last year, universal monocular metric depth estimation (universal MMDE) has gained considerable attention, serving as the foundation model for various multimedia tasks, such as video and image editing. Nonetheless, current approaches face challenges in maintaining consistent accuracy across diverse scenes without scene-specific parameters and pre-training, hindering the practicality of MMDE. Furthermore, these methods rely on extensive datasets comprising millions, if not tens of millions, of data for training, leading to significant time and hardware expenses. This paper presents SM4Depth, a model that seamlessly works for both indoor and outdoor scenes, without needing extensive training data and GPU clusters. Firstly, to obtain consistent depth across diverse scenes, we propose a novel metric scale modeling, i.e., variationbased unnormalized depth bins. It reduces the ambiguity of the conventional metric bins and enables better adaptation to large depth-gaps of scenes during training. Secondly, we propose a “divide and conquer" solution to reduce reliance on massive training data. Instead of estimating directly from the vast solution space, the metric bins are estimated from multiple solution sub-spaces to reduce complexity. Additionally, we introduce an uncut depth dataset, Campus Depth, to evaluate the depth accuracy and consistency across various indoor and outdoor scenes. Trained on a consumer-grade GPU using just 150K RGB-D pairs, SM4Depth achieves outstanding performance on the most never-before-seen datasets, especially maintaining consistent accuracy across indoors and outdoors. The code can be found in the supplementary material.
Primary Subject Area: [Generation] Multimedia Foundation Models
Secondary Subject Area: [Systems] Systems and Middleware
Relevance To Conference: 1. This paper primarily investigates consistently estimating high-accuracy depth from a single image or video in diverse scenes. 2. Relevance to multimodal: MDE is a technology that translates RGB into depth, which explores the association between the two modalities captured by different sensors (RGB camera and depth camera/LiDAR). 3. Relevance to multimedia: Our method can be widely used in multimedia processing technologies, such as video editing, image editing, image generation, 3D reconstruction, augmented reality, and even multimedia mobile apps represented by TikTok. 4. Contributions to multimedia or multimodal processing: Images in multimedia are collected from diverse scenes. It is challenging to consistently estimate high-accuracy depth in such data. This paper does not require scene-specific parameters and pre-training but ensures consistent and continuous depth across scenes, thereby improving the usability of higher-level multimedia tasks. Furthermore, our MDE method only employ 150K RGB-D training pairs and a consumer-grade GPU (RTX 3090), which is more cost-effective than 300M data (8*A100), 800M (48*A100) and 6100 data required by the previous methods. Therefore, training and deployment costs are greatly reduced, making MDE easier to apply in multimedia tasks. 5. This paper reviews and references over 15 multimedia papers published in ACMMM, ICME, TMM, TOMM in recent years.
Supplementary Material: zip
Submission Number: 3797
Loading