Abstract: Since labeling web pages requires a lot of human resources and time, web attribute extraction methods based on few-shot learning have gained the attention of researchers. However, these methods still rely heavily on sufficient labeled data of several seed websites. In order to effectively alleviate the lack of domain information, we design a web attribute extraction model based on dual-view prompt learning named EDDVPL, achieving page-level few-shot learning which uses only a small number of labeled web pages for training. Specifically, we first retrieve semantic prompt information of DOM tree view by a simplified algorithm to stimulate domain-related knowledge of the pre-trained language model. Then, we introduce task prompt information of template view by constructing a template indicating the extraction target, which can help the pre-trained language model quickly understand the task of web attribute extraction. Finally, we integrate the dual-view prompt information by template filling to jointly guide the training of the pre-trained language model at semantic and task levels. Extensive experimental results on the public SWDE dataset show that EDDVPL performs the best results compared to the baselines.
Loading