ACP-MVS: Efficient Multi-View Stereo with Attention-based Context Perception

Hao Jia, Gangwei Xu, Miaojie Feng, Xianqi Wang, Junda Cheng, Min Lin, Xin Yang

Published: 24 Oct 2025, Last Modified: 06 Nov 20252025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)EveryoneRevisionsCC BY 4.0

Abstract: Abstract—The core of Multi-View Stereo (MVS) is to find corresponding pixels in neighboring images. However, due to challenging regions in input images such as untextured areas, repetitive patterns, or reflective surfaces, existing methods struggle to find precise pixel correspondence therein, resulting in inferior reconstruction quality. In this paper, we present an efficient context-perception MVS network, termed ACP-MVS. The ACP-MVS constructs a context-aware cost volume that can enhance pixels containing essential context information while suppressing irrelevant or noisy information via our proposed Context-stimulated Weighting Fusion module. Furthermore, we introduce a new Context-Guided Global Aggregation module, based on the insight that similar-looking pixels tend to have similar depths, which exploits global contextual cues to implicitly guide depth detail propagation from high-confidence regions to low-confidence ones. These two modules work in synergy to substantially improve reconstruction quality of ACPMVS without incurring significant additional computational and time cost. Extensive experiments demonstrate that our approach not only achieves state-of-the-art performance but also offers the fastest inference speed and minimal GPU memory usage, providing practical value for practitioners working with high-resolution MVS image sets. Notably, our method ranks 2nd on the challenging Tanks and Temples advanced benchmark among all published methods. Code is available at https://github.com/HaoJia-mongh/ACP-MVS.