DiffTV: Identity-Preserved Thermal-to-Visible Face Translation via Feature Alignment and Dual-Stage Conditions

Published: 20 Jul 2024, Last Modified: 31 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The thermal-to-visible (T2V) face translation task is essential for enabling face verification in low-light or dark conditions by converting thermal infrared faces into their visible counterparts. However, this task faces two primary challenges. First, the inherent differences between the modalities hinder the effective use of thermal information to guide RGB face reconstruction. Second, translated RGB faces often lack the identity details of the corresponding visible faces, such as skin color. To tackle these challenges, we introduce DiffTV, the first Latent Diffusion Model (LDM) specifically designed for T2V facial image translation with a focus on preserving identity. Our approach proposes a novel heterogeneous feature alignment strategy that bridges the modal gap and extracts both coarse- and fine-grained identity features consistent with visible images. Furthermore, a dual-stage condition injection strategy introduces control information to guide identity-preserved translation. Experimental results demonstrate the superior performance of DiffTV, particularly in scenarios where maintaining identity integrity is critical.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Generation] Social Aspects of Generative AI
Relevance To Conference: Our research directly contributes to the multimedia and multimodal processing fields through its innovative approach to Thermal-to-Visible (T2V) translation, crucial for facial recognition across different modalities. Addressing the inherent limitations of thermal imaging—such as lack of race information and loss of identity details—our work, DiffTV, pioneers the use of a Latent Diffusion Model (LDM) to enhance identity preservation in low-light conditions. This is particularly relevant to the multimedia community as it fuses thermal and visual data, integrating control information at two conditional stages, thereby refining the translation process. The successful application of our method demonstrates significant advancements in multimodal understanding and processing, positioning it as a pivotal contribution to the community's endeavors in cross-modality research.
Supplementary Material: zip
Submission Number: 1337
Loading