Denoised and Dynamic Alignment Enhancement for Zero-Shot Learning

Published: 01 Jan 2025, Last Modified: 14 Apr 2025IEEE Trans. Image Process. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Zero-shot learning (ZSL) focuses on recognizing unseen categories by aligning visual features with semantic information. Recent advancements have shown that aligning each attribute with its corresponding visual region significantly improves zero-shot learning performance. However, the crude semantic proxies used in these methods fail to capture the varied appearances of each attribute, and are also easily confused by the presence of semantically redundant backgrounds, leading to suboptimal alignment. To combat these issues, we introduce a novel Alignment-Enhanced Network (AENet), designed to denoise the visual features and dynamically perceive semantic information, thus enhancing visual-semantic alignment. Our approach comprises two key innovations. (1) A visual denoising encoder, employing a class-agnostic mask to filter out semantically redundant visual information, thus producing refined visual features adaptable to unseen classes. (2) A dynamic semantic generator that crafts content-aware semantic proxies adaptively, steered by visual features, enabling AENet to discriminate fine-grained variations in visual contents. Additionally, we integrate a cross-fusion module to ensure comprehensive interaction between the denoised visual features and the generated dynamic semantic proxies, further facilitating visual-semantic alignment. Through extensive experiments across three datasets, the proposed method demonstrates that it narrows down the visual-semantic gap and sets a new benchmark in this setting.
Loading