Abstract: Monocular depth estimation (MDE) is one of the most difficult tasks in computer vision. The problem becomes even more complicated in case of aerial images due to the high complexity and lack of structure present in such scenarios. State-of-the-art MDE methods can cope with such environments only by using high amounts of resources. In this work, we try to provide a more resource-aware alternative by dynamically inserting scene priors through semantic features into the network. To this end, we initially propose a novel dynamic semantic-aware module, that combines features extracted from the RGB image with a dynamically weighted semantic map. The weights are gradually modified, according to the iteration number inside the training process. By pursuing this methodology, we initially predict the depth for larger objects inside the scene. As the training resumes, smaller objects are further emphasized, thus increasing object boundary confidences. We prove that this dynamic adaptation mechanism leads to a better and faster convergence. The adaptation technique is applied for various learning frameworks, by introducing three novel semantic-aware losses for depth-based regression, classification, and ordinal regression. We show the results on a large set of synthetic and real-life aerial images, captured in various scenarios and thus we prove the effectiveness of our approach in terms of convergence time, depth accuracy, and prediction time.
Loading