Abstract: In an attempt to imitate the success of transformers in the field of natural language processing into computer vision tasks, vision transformers (ViTs) have recently gained attention. Performance breakthroughs have been achieved in coarse-grained tasks like classification. However, dense prediction tasks, such as detection, segmentation, and depth estimation, require additional modifications and have been tackled only in an ad-hoc manner, by replacing the convolutional neural network encoder backbone of an existing architecture with a ViT. This study proposes a fully convolutional transformer that can perform both coarse and dense prediction tasks. The proposed architecture is, to the best of our knowledge, the first architecture composed of attention layers, even in the decoder part of the network. This is because our newly proposed local-global attention (LGA) can flexibly perform both downsampling and upsampling of spatial features, which are key operations required for dense prediction. Against existing ViTs on classification tasks, our architecture shows a reasonable trade-off between performance and efficiency. In the depth estimation task, our architecture achieves performance comparable to that of state-of-the-art transformer-based methods.
0 Replies
Loading