Abstract: More and more end-to-end text spotting methods based on Transformer architecture have demonstrated superior performance. These methods utilize a bipartite graph matching algorithm to perform one-to-one optimal matching between predicted objects and actual objects. However, the instability of bipartite graph matching can lead to inconsistent optimization targets, thereby affecting the training performance of the model. Existing literature applies denoising training to solve the problem of bipartite graph matching instability in object detection tasks. Unfortunately, this denoising training method cannot be directly applied to text spotting tasks, as these tasks need to perform irregular shape detection tasks and more complex text recognition tasks than classification. To relieve the issue, in this paper, we propose a novel denoising training method (DNTextSpotter) for arbitrary-shaped text detection and recognition. Specifically, we decompose the queries of the denoising part into noised positional queries and noised content queries. We use the four Bezier control points of the Bezier center curve to generate the noised positional queries. For the noised content queries, considering that the output of the text in a fixed positional order is not conducive to aligning position with content, we employ a masked character sliding method to initialize content queries to assist the alignment of text content and position. To improve the model's perception of the background, we further utilize an additional loss function for background characters classification in the denoising training part. Although DNTextSpotter is conceptually simple, it outperforms the state-of-the-art methods on four benchmarks (Total-Text, SCUT-CTW1500, ICDAR15, and Inverse-Text), especially yielding an improvement of $11.3\%$ against the best approach in Inverse-Text.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Systems] Systems and Middleware
Relevance To Conference: By enhancing the denoising training for improved recognition of irregular scene text, this work contributes significantly to multimedia and multimodal processing. Text spotting, as a component of multimedia data processing, plays a crucial role in converting characters from images into editable and searchable text formats. It bridges the gap between visual modalities (images) and linguistic modalities (text), enabling machines to understand the textual content within images, thus facilitating cross-modal understanding, association, and reasoning.
our main contributions can be summarized as follows:
1.We introduce a novel denoising training method to design an end-to-end text spotting architecture.
2.Taking into account the negative impact of directly using ground truth text scripts to initialize noised queries, which leads to misalignment between the position of the characters and the content of these characters, we design a masked character sliding method to preprocess these ground truth text scripts, thereby optimizing the alignment between text position and content.
Overall, our work provides an efficient and accurate method for processing and understanding text information in multimedia content, which is crucial for multimedia content analysis and processing.
Supplementary Material: zip
Submission Number: 2058
Loading