SCLSTE: Semi-supervised Contrastive Learning-Guided Scene Text Editing

Min Yin, Liang Xie, Haoran Liang, Xing Zhao, Ben Chen, Ronghua Liang

Published: 01 Jan 2025, Last Modified: 17 Apr 2025MMM (3) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The objective of scene text editing is to substitute the existing text with desirable text, while preserving the background and text styles intact. However, existing methods struggle to effectively replicate textual styles due to the complexity of real scenarios, and most are limited to training exclusively on labeled synthetic datasets. One alternative semi-supervised method incorporates unlabeled real-scene text images for training by using the original image as supervision and editing it with the same text. However, this approach risks degrading the model into an identity mapping network. To address these problems, we introduce a novel semi-supervised training strategy incorporating contrastive learning. It allows for editing real-scene text images with any text, circumventing the identity mapping issue while ensuring the accuracy of both text content and style. Moreover, we propose a robust Style-Aware Text Editing Module to address complexity of real scenarios and enhance the imitation of text styles. To the best of our knowledge, our work is the first to apply contrastive learning to the scene text editing. Extensive experiments demonstrate that our method outperforms existing models in terms of quality and quantity. Especially for the HEval metrics on both real scene (Tamper-Scene) and synthetic scene (Temper-Syn2k), we get both 12% improvement compared to state-of-the-art method.