Text Extraction from the Web via Text-to-Tag Ratio

Published: 01 Jan 2008, Last Modified: 30 Jan 2025DEXA Workshops 2008EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We describe a method to extract content text from diverse Web pages by using the HTML document's text-to-tag ratio rather than specific HTML cues that may not be constant across various Web pages. We describe how to compute the text-to-tag ratio on a line-by-line basis and then cluster the results into content and non-content areas. With this approach we then show surprisingly high levels of recall for all levels of precision, and a large space savings.
Loading