Convolutional Transformer Network: Future Pedestrian Location in First-Person Videos Using Depth Map and 3D Pose
Abstract: Future pedestrian trajectory prediction in first-person videos (egocentric videos) offers great prospects for autonomous vehicles and social robots. Given a first-person video stream, we aim to predict that person’s location and depth (distance between the observed person and the camera) in future frames. To locate the future trajectory of the person, we mainly consider the following three key factors: a) The image in the video sequence is the mapping of the actual 3D space scene on the 2D plane. We restore the spatial distribution of pedestrians in two-dimensional images to three-dimensional space. The distance of the pedestrian from the camera, which can be represented by the depth of the image, is the third dimension of information that is lost. b) First-person videos can utilize people’s 3D poses to represent intention interactions among people. c) The rules governing a pedestrian's historical trajectory are very important for the prediction of pedestrian's future trajectory. We incorporate these three factors into a multi-channel tensor to represent a deployment of the scene in three-dimensional space. We put this tensor into an end-to-end fully convolutional framework based on transformer architecture. Experimental results reveal our method to be effective on public benchmark MOT16.
Loading