PBFormer: Capturing Complex Scene Text Shape with Polynomial Band Transformer

Ruijin Liu; Ning Lu; Dapeng Chen; Cheng LI; Zejian Yuan; Wei Peng

PBFormer: Capturing Complex Scene Text Shape with Polynomial Band Transformer

Ruijin Liu, Ning Lu, Dapeng Chen, Cheng LI, Zejian Yuan, Wei Peng

Published: 01 Feb 2023, Last Modified: 13 Feb 2023Submitted to ICLR 2023Readers: Everyone

Keywords: Complex Shape Text Detection, Text Representation, Transformer, Computer Vision, Application

TL;DR: This paper presents PBFormer, an efficient yet powerful scene text detector that unifies the transformer with a novel text shape representation, Polynomial Band, which performs well for complex shape or crowded texts.

Abstract: We present PBFormer, an efficient yet powerful scene text detector that unifies the transformer with a novel text shape representation Polynomial Band (PB). The representation has four polynomial curves to fit a text's top, bottom, left, and right sides, which can capture a text with a complex shape by varying polynomial coefficients. PB has appealing features compared with conventional representations: 1) It can model different curvatures with a fixed number of parameters, while polygon-points-based methods need to utilize a different number of points. 2) It can distinguish adjacent or overlapping texts as they have apparent different curve coefficients, while segmentation-based methods suffer from adhesive spatial positions. PBFormer combines the PB with the transformer, which can directly generate smooth text contours sampled from predicted curves without interpolation. To leverage the advantage of PB, PBFormer has a parameter-free cross-scale pixel attention module. The module can enlarge text features and suppress irrelevant areas to benefit from detecting texts with diverse scale variations. Furthermore, PBFormer is trained with a shape-contained loss, which not only enforces the piecewise alignment between the ground truth and the predicted curves but also makes curves' position and shapes consistent with each other. Without bells and whistles about text pre-training, our method is superior to the previous state-of-the-art text detectors on the arbitrary-shaped CTW1500 and Total-Text datasets. Codes will be public.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)

Supplementary Material: zip

27 Replies

Loading