Abstract: Data extraction from web pages often involves either human intervention for training a wrapper or a reduced level of granularity in the information acquired. Even though the study of social media has drawn the attention of researchers, weblogs remain a part of the web that cannot be harvested efficiently. In this paper, we propose a fully automated approach in generating a wrapper for weblogs, which exploits web feeds for cheap labelling of weblog properties. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. Our evaluation shows that our approach is robust, accurate and efficient in handling different types of weblogs.
0 Replies
Loading