Extracting context to improve accuracy for HTML content extractionOpen Website

2005 (modified: 12 Nov 2022)WWW (Special interest tracks and posters) 2005Readers: Everyone
Abstract: Previous work on content extraction utilized various heuristics such as link to text ratio, prominence of tables, and identification of advertising. Many of these heuristics were associated with "settings", whereby some heuristics could be turned on or off and others parameterized by minimum or maximum threshold values. A given collection of settings - such as removing table cells with high linked to non-linked text ratios and removing all apparent advertising -- might work very well for a news website, but leave little or no content left for the reader of a shopping site or a web portal We present a new technique, based on incrementally clustering websites using search engine snippets, to associate a newly requested website with a particular "genre", and then employ settings previously determined to be appropriate for that genre, with dramatically improved content extraction results overall.
0 Replies

Loading