M2H-MX: Multi-Task Semantic and Geometric Perception for Real-Time Monocular 3D Scene Graph Construction

Published: 27 May 2026, Last Modified: 27 May 2026ICRA 2026 SRRA Workshop LightningTalkPosterEveryoneRevisionsCC BY 4.0
Keywords: 3D scene graphs, dense multi task learning, robot perception, real-time scene understanding
TL;DR: M2H-MX enables real time 3D scene graph construction from a single RGB camera, delivering stable metric semantic maps without RGB D or LiDAR at 25 FPS on Jetson Orin NX.
Abstract: Reliable robot autonomy requires spatial representations that ground semantic understanding in metric geometry that remains stable under motion. 3D scene graphs offer such a structure, but existing systems typically rely on RGB-D or LiDAR sensing to obtain reliable geometry, limiting deployment on compact robotic platforms. This paper presents M2H-MX, a monocular dense perception front end that predicts metric depth and semantic labels from RGB frames. These predictions are integrated with IMU-assisted odometry in a real-time metric-semantic mapping pipeline for downstream 3D scene graph construction. M2H-MX combines DINOv3 feature adaptation, register-guided multi-scale decoding, and directed cross-task refinement so that global scene context, local boundaries, and geometric cues are fused before map integration. Unlike dense prediction studies that stop at per-frame metrics, we evaluate M2H-MX both as a benchmark predictor and as a deployed mapping component. On NYUDv2, M2H-MX improves semantic mIoU by 4.06 points and reduces depth RMSE by 9.4% over the strongest multi-task baseline considered. In ScanNet deployment, the M2H-MX Mono-Hydra stack reduces average ATE from 17.59 cm to 6.91 cm compared with monocular GO-SLAM, while sustaining a 25–30 Hz asynchronous perception-to-mapping loop at 640 × 480 input resolution on an RTX 4080 Super. A TensorRT FP16 deployment reaches 25 FPS on a Jetson Orin NX 16 GB at 192×256 input resolution. These results indicate that improving the monocular perception front end can strengthen the metric-semantic representation required for real-time 3D scene graph construction.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 42
Loading