Abstract: The core of Multi-view Stereo(MVS) is the matching process among reference and source pixels.
Cost aggregation plays a significant role in this process, while previous methods focus on handling it
via CNNs. This may inherit the natural limitation
of CNNs that fail to discriminate repetitive or incorrect matches due to limited local receptive fields.
To handle the issue, we aim to involve Transformer
into cost aggregation. However, another problem
may occur due to the quadratically growing computational complexity caused by Transformer, resulting in memory overflow and inference latency.
In this paper, we overcome these limits with an
efficient Transformer-based cost aggregation network, namely CostFormer. The Residual DepthAware Cost Transformer(RDACT) is proposed to
aggregate long-range features on cost volume via
self-attention mechanisms along the depth and spatial dimensions. Furthermore, Residual Regression
Transformer(RRT) is proposed to enhance spatial
attention. The proposed method is a universal plugin to improve learning-based MVS methods.
0 Replies
Loading