Abstract: Autonomous driving has various visual perception tasks such as object detection,
motion detection, depth estimation and flow estimation. Multi-task learning (MTL)
has been successfully used for jointly estimating some of these tasks. Previous
work was focused on utilizing appearance cues. In this paper, we address the gap
of incorporating motion cues in a multi-task learning system. We propose a novel
two-stream architecture for joint learning of object detection, road segmentation
and motion segmentation. We designed three different versions of our network to
establish systematic comparison. We show that the joint training of tasks signifi-
cantly improves accuracy compared to training them independently even with a
relatively smaller amount of annotated samples for motion segmentation. To enable
joint training, we extended KITTI object detection dataset to include moving/static
annotations of the vehicles. An extension of this new dataset named KITTI MOD
is made publicly available via the official KITTI benchmark website . Our baseline
network outperforms MPNet which is a state of the art for single stream CNN-based
motion detection. The proposed two-stream architecture improves the mAP score
by 21.5% in KITTI MOD. We also evaluated our algorithm on the non-automotive
DAVIS dataset and obtained accuracy close to the state-of-the-art performance.
The proposed network runs at 8 fps on a Titan X GPU using a two-stream VGG16
encoder. Demonstration of the work is provided in.
Keywords: multitask learning, autonomous driving, motion segmentation
3 Replies
Loading