Abstract: Due to the limitations of infrared image acquisition conditions, many essential tasks currently rely on visible images as the main source of training data. However, single-modal data makes it difficult for downstream networks to show optimal performance. Therefore, converting the more easily obtainable visible images into infrared images emerges as an effective remedy to alleviate the critical shortage of infrared data. Yet current methods typically focus solely on transferring visible images to infrared style, while overlooking the crucial infrared thermal feature during cross-modal translation. To elevate the authenticity of cross-model translation at the feature level, this paper introduces a translation network based on frequency feature mapping and dual patches contrast, MappingFormer, which can achieve cross-modal image generation from visible to infrared. Specifically, the generator incorporates two branches: low-frequency feature mapping (LFM) and high-frequency feature refinement (HFR), both have embedded the Swin Transformer blocks. The LFM branch captures the fuzzy structural from visible images, while the HFR focuses on mapping edge and texture features. The extracted dual-branch frequency features undergo refinement and fusion through cross-attention mechanisms. Additionally, a dual contrast learning mechanism based on feature patch (DFPC) is designed to infer effective mappings between unaligned cross-modal data. Numerous experimental results prove the effectiveness of this method in cross-modal feature mapping and image generation from visible to infrared. This method holds significant potential for downstream tasks where infrared data is limited.
Primary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: This work belongs to the field of generative multimedia, primarily focusing on feature learning and image generation between visible and infrared images. We have designed a novel cross-modal image generation method to alleviate the scarcity of infrared image data for certain scenes or tasks. More precisely, we have formulated an efficient generative adversarial network model to tackle authenticity and effectiveness challenges in cross-modal image translation through feature mapping and contrastive learning. This work finds application in converting visible images or videos to infrared modal, thereby overcoming the obstacle posed by the unavailability or scarcity of infrared domain data. Consequently, this work belongs to the multimodal/cross-modal/multi-domain research, and we sincerely hope this paper is suitable for “ACM MULTIMEDIA 2024”.
Submission Number: 3637
Loading