Abstract: We describe a method to extract content text from diverse Web pages by using the HTML document's text-to-tag ratio rather than specific HTML cues that may not be constant across various Web pages. We describe how to compute the text-to-tag ratio on a line-by-line basis and then cluster the results into content and non-content areas. With this approach we then show surprisingly high levels of recall for all levels of precision, and a large space savings.
Loading