Abstract: Depth estimation, essential for autonomous driving, seeks to interpret the 3D environment surrounding vehicles. The development of radar sensors, known for their cost-efficiency and robustness, has spurred interest in radar-camera fusion-based solutions. However, existing algorithms fuse features from these modalities without accounting for weather conditions, despite radars being known to be more robust than cameras under adverse weather. Additionally, while Vision-Language models have seen rapid advancement, utilizing language descriptions alongside other modalities for depth estimation remains an open challenge. This paper first introduces a text-generation strategy along with feature extraction and fusion techniques that can assist monocular depth estimation pipelines, leading to improved accuracy across different algorithms on the KITTI dataset. Building on this, we propose TRIDE, a radar-camera fusion algorithm that enhances text feature extraction by incorporating radar point information. To address the impact of weather on sensor performance, we introduce a weather-aware fusion block that adaptively adjusts radar weighting based on current weather conditions. Our method, benchmarked on the nuScenes dataset, demonstrates performance gains over the state-of-the-art, achieving a 12.87% improvement in MAE and a 9.08% improvement in RMSE.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: The requested changes are summarized below.
Main paper:
- Updated Figure 1 for better alignment with Figure 2.
- Revised Section 3.4 by adding more formulas for improved readability.
- Included the previously missing Table 2 (now located on page 10).
- Modified Table 7 with additional experiments.
- Added Section 4.5, comparing qualitative results under different weather conditions (moved from the Appendix to the main paper).
Appendix:
- Added the loss convergence curves and hyperparameter details (A.4).
- Added failure-case analysis (A.6).
- Added more experiments on integrating the text branch into an existing camera–radar depth estimation method (A.7).
- Added runtime analysis (A.8).
Assigned Action Editor: ~Adam_W_Harley1
Submission Number: 4834
Loading