Abstract: The increasing use of smartphones for capturing documents in various real-world conditions has underscored the need for robust document localization technologies. Current challenges in this domain include handling diverse document types, complex backgrounds, and varying photographic conditions such as low contrast and occlusion. However, there currently are no publicly available datasets containing these complex scenarios and few methods demonstrate their capabilities on these complex scenes. To address these issues, we create a new comprehensive real-world document localization benchmark dataset which contains the complex scenarios mentioned above and propose a novel Real-world Document Localization Network (RDLNet) for locating targeted documents in the wild. The RDLNet consists of an innovative light-SAM encoder and a masked attention decoder. Utilizing light-SAM encoder, the RDLNet transfers the mighty generalization capability of SAM to the document localization task. In the decoding stage, the RDLNet exploits the masked attention and object query method to efficiently output the triple-branch predictions consisting of corner point coordinates, instance-level segmentation area and categories of different documents without extra post-processing. We compare the performance of RDLNet with other state-of-the-art approaches for real-world document localization on multiple benchmarks, the results of which reveal that the RDLNet remarkably outperforms contemporary methods, demonstrating its superiority in terms of both accuracy and practicability.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Experience] Interactions and Quality of Experience, [Content] Multimodal Fusion
Relevance To Conference: In this paper we propose a novel method to address the real-world document localization issues. The new method can handle the bi-medias of both images and videos captured by mobile devices. At the same time, our method utilizes multimodal information such as the position coordinate points, area masks, and edge lines of the document to achieve precise localization of the document. Moreover, we create an opensource dataset for document localization on images or video frames.
Supplementary Material: zip
Submission Number: 5359
Loading