Textual Alchemy: CoFormer for Scene Text Understanding

Published: 01 Jan 2024, Last Modified: 18 Jun 2024WACV 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The paper presents CoFormer (Convolutional Fourier Transformer), a robust and adaptable transformer architecture designed for a range of scene text tasks. CoFormer integrates convolution and Fourier operations into the transformer architecture. Thus, it leverages convolution properties such as shared weights, local receptive fields, and spatial subsampling, while the Fourier operation emphasizes composite characteristics from the frequency domain. The research further proposes two new pretraining datasets, named Textverse10M-E and Textverse10M-H. Using these datasets, we demonstrate the efficacy of pretraining for scene text understanding. CoFormer achieves state-of-the-art results with and without pretraining on two downstream tasks: scene text recognition (STR) and scene text editing (STE). The paper further proposes LISTNet (Language Invariant Style Transfer), a novel framework for bi-lingual STE. It also introduces three datasets, viz., TST500K for STE, CSTR2.5M and Akshara550 for STR. The source-code of CoFormer is available at https://github.com/CandleLabAI/CoFormer-WACV-2024.
Loading