Abstract: We introduce a lightweight and accurate architecture for resource-efficient visual correspondence. Our method, dubbed XFeat (Accelerated Features), revisits fundamen-tal design choices in convolutional neural networks for de-tecting, extracting, and matching local features. Our new model satisfies a critical need for fast and robust algorithms suitable to resource-limited devices. In particular, accu-rate image matching requires sufficiently large image res-olutions -for this reason, we keep the resolution as large as possible while limiting the number of channels in the net-work. Besides, our model is designed to offer the choice of matching at the sparse or semi-dense levels, each of which may be more suitable for different downstream applications, such as visual navigation and augmented reality. Our model is the first to offer semi-dense matching efficiently, leveraging a novel match refinement module that relies on coarse local descriptors. XFeat is versatile and hardware-independent, surpassing current deep learning-based local features in speed (up to 5xfaster) with comparable or better accuracy, proven in pose estimation and visual localization. We showcase it running in real-time on an inexpensive lap-top CPU without specialized hardware optimizations. Code and weights are available at verlab.dcc.ufmg.br/descriptors/xfeat_cvpr24.
Loading