A Learning Approach to Discovering Web Page Semantic Structures

Junlan Feng, Patrick Haffner, Mazin Gilbert

2005 (modified: 16 May 2023)ICDAR 2005Readers: Everyone

Abstract: This paper proposes a learning approach for discovering the semantic structure of Web pages. The task includes partitioning the text on a Web page into information blocks and identifying their semantic categories. We employed two machine learning techniques, Adaboost and SVMs, to learn from a labeled Web page corpus. We evaluated our approach on general Web pages from the World Wide Web and obtained encouraging results. This work can be beneficial to a number of Web-driven applications such as search engines, Web-based question answering, Web-based data mining as well as voice enabled Web navigation.

0 Replies