Adaptive Scaling and Refined Pyramid Feature Fusion Network for Scene Text Segmentation

Published: 01 Jan 2024, Last Modified: 11 Apr 2025ICDAR (5) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Although scene text recognition has achieved high performance, text segmentation still needs to be improved. The goal of text segmentation is to obtain pixel-level foreground text masks from scene images. In this paper, we adaptively resize the input images to their optimal scales and propose the Refined Pyramid Feature Fusion Network (RPFF-Net) for robust scene text segmentation. To address the issue of inconsistent text scaling, we propose an adaptive image scaling method that takes into account the density of text regions in each scene image. In the RPFF-Net, we first extract multi-scale features from the backbone network, and then combine these features using effective pyramid feature fusion methods. To enhance the interaction between text from contextual characters and extract features at different levels, we apply two self-attention mechanisms to the fusion feature map in spatial and channel dimensions. The experimental results demonstrate the effectiveness of our approach on several text segmentation benchmarks including the monolingual TextSeg and bilingual BTS datasets, and show that it outperforms the existing state-of-the-art scene text segmentation methods even without OCR (optical character recognition) enhancement.
Loading