Abstract: Extracting structured information from all manner of webpages is an important problem with the potential to automate many real-world applications. Recent work has shown the effectiveness of leveraging DOM trees and pre-trained language models to describe and encode webpages. However, they typically optimize the model to learn the semantic co-occurrence of elements and labels in the same webpage, thus their effectiveness depends on sufficient labeled data, which is labor-intensive. In this paper, we further observe structural co-occurrences in different webpages of the same website: the same position in the DOM tree usually plays the same semantic role, and the DOM nodes in this position also share similar surface forms. Motivated by this, we propose a novel method, Structor, to effectively incorporate the structural co-occurrences over DOM tree and surface form into pre-trained language models. Such structural co-occurrences help the model learn the task better under low-resource settings, and we study two challenging experimental scenarios: website-level low-resource setting and webpage-level low-resource setting, to evaluate our approach. Extensive experiments on the public SWDE dataset show that Structor significantly outperforms the state-of-the-art models in both settings, and even achieves three times the performance of the strong baseline model in the case of extreme lack of training data.
0 Replies
Loading