Advanced Deep Web Crawler Based on Dom

Weicheng Ma, Xiuxia Chen, Wenqian Shang

Published: 2012, Last Modified: 18 Jun 2024CSO 2012EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Due to the fact that large amount of data today can only be stored in deep web. In view of the work done by others on deep web crawlers, it is extinct that no perfect, or even complete crawlers for deep web data has been made. To meet the needs of deep web search, we have worked out a new structure of crawler, currently concerned most on extracting data from forms - the most common type of deep web interface. Our crawler's makes some innovative parts such as the mainframe extracting module and the algorithm to distinguish different websites with the same url using improved Bayesian classification and to expand the function to AJAX form dealing and so on. Also, Dom Tree is used to make easier and more visual the analysis and treatment of downloaded web pages.