Large scale joint semantic re-localisation and scene understanding via globally unique instance coordinate regression

Ignas Budvytis, Marvin Teichmann, Tomas Vojir, Roberto Cipolla

02 Nov 2022OpenReview Archive Direct UploadReaders: Everyone

Abstract: In this work we present a novel approach to joint semantic localisation and scene understanding. Our work is motivated by the need for localisation algorithms which not only predict 6-DoF camera pose but also simultaneously recognise surrounding objects and estimate 3D geometry. Such capabilities are crucial for computer vision guided systems which interact with the environment: autonomous driving, augmented reality and robotics. In particular, we propose a two step procedure. During the first step we train a convolutional neural network to jointly predict per-pixel globally unique instance labels [7] and corresponding local coordinates for each instance of a static object (e.g. a building). During the second step we obtain scene coordinates [32] by combining object center coordinates and local coordinates and use them to perform 6-DoF camera pose estimation. We evaluate our approach on real world (CamVid-360) and artificial (SceneCity) autonomous driving datasets [7]. We obtain smaller mean distance and angular errors than state-of-the-art 6-DoF pose estimation algorithms based on direct pose regression [14, 15] and pose estimation from scene coordinates [3] on all datasets. Our contributions include: (i) a novel formulation of scene coordinate regression as two separate tasks of object instance recognition and local coordinate regression and a demonstration that our proposed solution allows to predict accurate 3D geometry of static objects and estimate 6-DoF pose of camera on (ii) maps larger by several orders of magnitude than previously attempted by scene coordinate regression methods [3, 4, 20, 32], as well as on (iii) lightweight, approximate 3D maps built from 3D primitives such as buildingaligned cuboids.

0 Replies