Labeling Document Images for E-Commence Products with Tree-Based Segment Re-organizing and Hierarchical Transformer
Abstract: Document images of products have been widely used in E-commence. As a kind of special data, the contents in document images are quite diverse: texts can be scattered anywhere with pictures, and both short text snippets and long text chunks exist. To predict text labels in document images, we propose a two stage approach. The first stage, named as tree-based segment re-organizing, is to resume text order and text connection through hierarchical clustering, segment reordering and segment merging. The second stage, named as hierarchical transformer, is to generate segment embeddings and predict segment labels, where segment level and document level encoder are applied. We empirically study the effects of incorporating different features and compare two kinds of attention to aggregate context, where distance and direction are measured in 1D and 2D respectively. Experiments based on a real-world dataset show that our proposed segment re-organizing method can reduce about 40% input size to the labeling model while bring negligible impact to performance. For hierarchical transformer, we empirically show that document encoder using 1D attention is more effective than 2D attention.
0 Replies
Loading