Towards Enhanced Information Access in Finance: A Dataset for Table Structure Understanding in Annual Securities Reports
Abstract: Despite advancements in information access and natural language processing technologies, research on information retrieval for non-textual information in real-world documents remains limited. Tables, in particular, serve as crucial sources of various kinds of information, which makes structuring tabular data an important issue. In this work, we focus on understanding the structures of tables in Japanese annual securities reports, a specific type of real-world document, and undertake a cell-type classification task. We constructed a new dataset by manually annotating over 111,000 cells from more than 4,000 tables. In addition to implementing a baseline program, we organized a shared task using this new dataset. The results revealed that the best-performing system completed only 75% of the tables, thus indicating the ongoing challenge of understanding table structures. Our dataset and baseline program will be made available on GitHub for researchers in this field.
Loading