Evp: Enhanced visual perception using inverse multi-attentive feature refinement and regularized image-text alignment

Published: 12 Dec 2023, Last Modified: 12 Nov 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0
Abstract: This work presents the network architecture EVP (Enhanced Visual Perception). EVP builds on the previous work VPD which paved the way to use the Stable Diffusion network for computer vision tasks. We propose two major enhancements to align both the image (spatial) features and the text features. First, we develop the Inverse Multi-Attentive Feature Alignment (IMAFA) module, which enhances spatial feature learning capabilities by aggregating information from higher pyramid levels. Second, we propose a novel image-text alignment module for improved feature extraction of the Stable Diffusion backbone. The resulting architecture is suitable for a wide variety of tasks and we demonstrate its performance in the context of single-image depth estimation and referring segmentation. Comprehensive experiments conducted on established datasets show that EVP achieves state-of-the-art results in single-image depth estimation for indoor (NYU Depth v2, RMSE improvement over VPD) and outdoor (KITTI, RMSE improvement over GEDepth) environments, as well as referring segmentation (RefCOCO, 2.53 IoU improvement over ReLA). On the Booster dataset, we outperform our baseline VPD by more than 45% in both REL and RMSE metrics. The code and pre-trained models are publicly available at https://github.com/Lavreniuk/EVP.
Loading