Abstract: Nowadays, the increasing digitization of historical document resources not only provides researchers with abundant materials but also introduces a new challenge: how to analyze their contents swiftly and accurately? Article separation emerges as a solution to this issue, overcoming the disorder and inefficiency associated with querying the original document material directly. However, there is limited in-depth research on this topic currently. Given that historical documents contain information in multiple modalities such as text and images, the reasonable integration and analysis of multimodal information pose a further challenge.
In response to these challenges, we propose LIAS, a method based on layout information, and conduct experiments on historical newspapers. In comparison to existing work, LIAS introduces a fresh perspective to research in this area. The method initially identifies the separator lines of the newspaper, analyzes the layout information to reconstruct the information flow of the document, performs segmentation based on the semantic relationship of each text block in the information flow, and ultimately achieves article separation. The experiments encompass diverse historical newspapers in French and Finnish, featuring various layouts. Results demonstrate that LIAS outperforms current schemes in most cases, as evidenced by metrics specifically designed for article separation tasks.
Loading