Abstract: Predicting the future is a crucial ability for intelligence systems. It is of great importance for many real-world applications, such as autonomous driving, which need scene parsing to understand the environment. Recent research has shown that predicting in semantic level is more effective than segmenting the predicted RGB frames. In order to label pixels in future frames correctly, the rich contextual dependencies should be exploit which existing methods paid less attention to. Therefore, we propose a novel network which catches both the short-term and long-term relations of observed frames for future scene parsing. Specifically, we introduce an attention mechanism to model semantic interdependencies between consecutive frames and a modified convolutional LSTM to model the correlations among all the input frames. Experiments validate that our approach outperforms other state-of-the-art methods on the large-scale Cityscapes dataset.
0 Replies
Loading