SheetPT: Spreadsheet Pre-training Based on Hierarchical Attention Network

Ran Jia, Qiyu Li, Zihan Xu, Xiaoyuan Jin, Lun Du, Haoyu Dong, Xiao Lv, Shi Han, Dongmei Zhang

Published: 2023, Last Modified: 03 Oct 2023AAAI 2023Readers: Everyone

Abstract: Spreadsheets are an important and unique type of business document for data storage, analysis and presentation. The distinction between spreadsheets and most other types of digital documents lies in that spreadsheets provide users with high flexibility of data organization on the grid. Existing related techniques mainly focus on the tabular data and are incompetent in understanding the entire sheet. On the one hand, spreadsheets have no explicit separation across tabular data and other information, leaving a gap for the deployment of such techniques. On the other hand, pervasive data dependence and semantic relations across the sheet require comprehensive modeling of all the information rather than only the tables. In this paper, we propose SheetPT, the first pre-training technique on spreadsheets to enable effective representation learning under this scenario. For computational effectiveness and efficiency, we propose the coherent chunk, an intermediate semantic unit of sheet structure; and we accordingly devise a hierarchical attention-based architecture to capture contextual information across different structural granularities. Three pre-training objectives are also designed to ensure sufficient training against millions of spreadsheets. Two representative downstream tasks, formula prediction and sheet structure recognition are utilized to evaluate its capability and the prominent results reveal its superiority over existing state-of-the-art methods.

0 Replies