Garbage in, garbage out: An analysis of HTML text extractors and their impact on NLP performance

Vlad Cristian Dumitru, Denis Iorga, Stefan Ruseti, Mihai Dascalu

Published: 2023, Last Modified: 19 Jan 2024CSCS 2023Readers: Everyone

Abstract: Technological advancement has significantly facilitated the research and development of Artificial Intelligence, with particular emphasis on Natural Language Processing (NLP). High-quality data is crucial to achieving success in this area. This aspect becomes particularly important considering the recent widespread adoption of large language models trained on a considerable amount of text from the Internet. This research expands on the issue of data quality in NLP by examining the impact of automated text extraction techniques from HTML on the performance of specific NLP tasks. For this purpose, an empirical evaluation was conducted to assess the efficacy of various automated techniques for HTML text extraction using 300 news articles written in English, Romanian, and French. The evaluation was conducted by comparing the results of the most popular automated text extraction technologies (i.e., “boiler”, “justext”, “newspaper”, “readability”, and “trafilatura”) against the results of human-validated texts. Both extracted texts, automated and human-validated, were subjected to three NLP tasks: named entity recognition, sentiment analysis, and text summarization. Our analysis of the NLP results indicates that text from Romanian online news articles should be extracted with “newspaper”, whereas “trafilatura” should be used for English and French articles, regardless of the NLP task. Overall, our study provides a comprehensive understanding of the performance of the selected technologies for extracting the text of online news articles by language and NLP task.

0 Replies