Abstract: Most Scene Text Removal (STR) frameworks are based on predicting the text mask. The predicted text mask is either supervised by the text stroke mask label or the text box mask label in training. We find some matters related to the text stroke mask label, including the label being noisy and the inappropriate 0–1 text stroke mask representation (hard mask). We propose that these matters could be handled by the smooth text stroke mask. Specifically, we made considerable synthetic text segmentation data (text image and its smooth text stroke mask label) by a text image synthesis engine and then trained the text stroke segmentation sub-network of our framework only on the above synthetic data. We also discover that most STR frameworks lack an effective receptive field in their text region inpainting network, which limits their perception ability of the global structure. For the receptive field issue, we devise a two-stage coarse-to-refinement text region inpainting sub-network that consists of a coarse-inpainting stage with a global receptive field and a refinement stage with a local receptive field. Experiments on the benchmark datasets demonstrate that our framework outperforms existing state-of-the-art methods in all Image-Eval metrics and Detection-Eval metrics.
Loading