Large scale joint semantic re-localisation and scene understanding via globally unique instance coordinate regression
Abstract: In this work we present a novel approach to joint semantic localisation and scene
understanding. Our work is motivated by the need for localisation algorithms which not
only predict 6-DoF camera pose but also simultaneously recognise surrounding objects
and estimate 3D geometry. Such capabilities are crucial for computer vision guided systems which interact with the environment: autonomous driving, augmented reality and
robotics. In particular, we propose a two step procedure. During the first step we train
a convolutional neural network to jointly predict per-pixel globally unique instance labels [7] and corresponding local coordinates for each instance of a static object (e.g. a
building). During the second step we obtain scene coordinates [32] by combining object center coordinates and local coordinates and use them to perform 6-DoF camera
pose estimation. We evaluate our approach on real world (CamVid-360) and artificial
(SceneCity) autonomous driving datasets [7]. We obtain smaller mean distance and angular errors than state-of-the-art 6-DoF pose estimation algorithms based on direct pose
regression [14, 15] and pose estimation from scene coordinates [3] on all datasets. Our
contributions include: (i) a novel formulation of scene coordinate regression as two separate tasks of object instance recognition and local coordinate regression and a demonstration that our proposed solution allows to predict accurate 3D geometry of static objects
and estimate 6-DoF pose of camera on (ii) maps larger by several orders of magnitude
than previously attempted by scene coordinate regression methods [3, 4, 20, 32], as well
as on (iii) lightweight, approximate 3D maps built from 3D primitives such as buildingaligned cuboids.
0 Replies
Loading