Abstract: Recent  approaches  on  visual  scene  understanding  attemptto build a scene graph – a computational representation of objects andtheir  pairwise  relationships.  Such  rich  semantic  representation  is  veryappealing,  yet  difficult  to  obtain  from  a  single  image,  especially  whenconsidering  complex  spatial  arrangements  in  the  scene.  Differently,  animage sequence conveys useful information using the multi-view geomet-ric  relations  arising  from  camera  motions.  Indeed,  object  relationshipsare naturally related to the 3D scene structure. To this end, this paperproposes a system that first computes the geometrical location of objectsin a generic scene and then efficiently constructs scene graphs from videoby embedding such geometrical reasoning. Such compelling representa-tion is obtained using a new model where geometric and visual featuresare merged using an RNN framework. We report results on a dataset wecreated for the task of 3D scene graph generation in multiple views.
0 Replies
Loading