Abstract: New web content is published constantly, and although protocols such as RSS can notify subscribers of new pages, they are not always implemented or actively maintained. A more reliable way to discover new content is to periodically re-crawl the target sites. Designing such “content discovery crawlers” has important applications, for example, in web search, digital assistants, business, humanitarian aid, and law enforcement. Existing approaches assume that each site of interest has a relatively small set of unknown “source pages” that, when refreshed, frequently provide hyperlinks to the majority of new content. The state of the art (SOTA) uses ideas from the multi-armed bandit literature to explore candidate sources while simultaneously exploiting known good sources. We observe, however, that the SOTA uses a sub-optimal algorithm for balancing exploration and exploitation. We trace this back to a mismatch between the space of actionsthat the SOTA algorithm models and the space of actions that the crawler must actually choose from. Our proposed approach, the Thompson crawler (named after the Thompson sampler that drives its refresh decisions), addresses this shortcoming by more faithfully modeling the action space. On a dataset of 4,070 source pages drawn from 53 news domains over a period of 7 weeks, we show that, on average, the Thompson crawler discovers 20% more new pages, fnds pages 6 hours earlier, and uses 14 fewer refreshes per 100 pages discovered than the SOTA.
0 Replies
Loading