Abstract: We tackle the problem of deep end-to-end multi-task learning (MTL) for visual scene understanding from monocular images in this paper. It is proven that learning several related tasks together helps in attaining improved performance per-task than training them independently. This is due to the fact that related tasks share important feature characteristics among themselves, which the MTL techniques can effectively explore for improved joint training. Based on this premise, we are interested in generic to specific feature extraction given the different tasks within a common framework. To this end, we propose a typical U-Net based encoder-decoder architecture called AdaMT-Net, where the densely-connected deep convolutional neural network (CNN) based feature encoder is shared among the tasks while the soft-attention based task-specific decoder modules produce the desired outcomes. One major issue in MTL is to assign the weights for the task-specific loss-terms in the final cumulative optimization function. As opposed to the manual approach, we propose a novel adaptive weight learning strategy by carefully exploring the loss-gradients per-task over the training iterations. Experimental results on the benchmark CityScapes, NYUv2, and ISPRS datasets confirm that AdaMT-Net achieves state-of-the-art performance on most of the evaluation metrics.
0 Replies
Loading