Language-Based Colorization with Sparse Attention and Multi-scale Cross-Modal Semantic Alignment

Ying Zhang, Yutong Gao, Xuan Liu, Lefeng Zhang, Xianggan Liu, Shan Jiang

Published: 01 Jan 2025, Last Modified: 12 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: Language-based colorization generates realistic and aesthetically appealing colors by leveraging the guidance of intuitive and user-friendly natural language descriptions. Previous methods in language-based image colorization face several significant challenges, including limited color richness, color bleeding, color distortions, and inconsistent color styles. These issues often arise due to the reliance on dense attention mechanisms, which can lead to an overemphasis on global features at the expense of local details. Furthermore, many feature fusion modules for text and grayscale images are not designed to achieve adequate alignment of the diverse feature representations, resulting in suboptimal colorization outcomes. In this paper, we explore a novel module that employs sparse attention to mitigate the issue of color bleeding and limited colorfulness in image colorization tasks. By introducing a sparse attention mechanism, our method enables more flexible allocation of computational resources with a focus on color awareness. Additionally, we propose a module that effectively aligns grayscale images with color descriptions, thereby improving the consistency and quality of the colorization results. This module leverages the synergy between image features and language descriptions to solve long-standing issues like color bleeding. Empirical evaluations demonstrate that our approach surpasses recent state-of-the-art techniques in both automatic and language-based colorization tasks, validating the effectiveness and robustness of our proposed method in generating high-quality, visually appealing colorized images.

External IDs:doi:10.1007/978-981-96-1548-3_14