Web user profiling using data redundancy

Published: 2016, Last Modified: 27 Sept 2024ASONAM 2016EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The study of Web user profiling can be traced back to 30 years ago, with the goal of extracting “semantic”-based user profile attributes from the unstructured Web. Despite slight differences, the general method is to first identify relevant pages of a specific user and then use machine learning models (e.g., CRFs) to extract the profile attributes from the page. However, with the rapid growth of the Web volume, such a method suffers from data redundancy and error propagation between the two steps. In this paper, we revisit the problem of Web user profiling in the big data era, trying to deal with the new challenges. We propose a simple but very effective approach for extracting user profile attributes from the Web using big data. To avoid error propagation, the approach processes all the extraction subtasks in one unified model. To further incorporate human knowledge to improve the extraction performance, we propose a Markov logic factor graph (MagicFG) model. The MagicFG model describes human knowledge as first-order logics and combines the logics into the extraction model. Our experiments on a real data set show that the proposed method significantly improves (+4-6%; p ≪ 0.01, t-test) the extraction performance in comparison with several baseline methods.
Loading