Abstract: This paper proposes a learning approach for discovering the semantic structure of Web pages. The task includes partitioning the text on a Web page into information blocks and identifying their semantic categories. We employed two machine learning techniques, Adaboost and SVMs, to learn from a labeled Web page corpus. We evaluated our approach on general Web pages from the World Wide Web and obtained encouraging results. This work can be beneficial to a number of Web-driven applications such as search engines, Web-based question answering, Web-based data mining as well as voice enabled Web navigation.
0 Replies
Loading