HQWiki: A Pipeline for High-Quality Monolingual Datasets

Published: 05 Feb 2025, Last Modified: 23 Apr 2025WD&R PosterEveryoneRevisionsBibTeXCC BY 4.0
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Authors Biographies: Our related works: https://aclanthology.org/2024.iwclul-1.11/ and https://arxiv.org/abs/2410.12989
Keywords: monolingual datasets, dataset preparation, data quality, language identification
TL;DR: Creating clean monolingual datasets is hard due to mixed languages and noise. We built HQWiki, a pipeline that automatically filters and validates content to ensure high quality.
Abstract: Creating high-quality monolingual datasets poses significant challenges due to cross-language contamination, metadata noise, and inconsistent quality control, especially for less-resourced languages. We present HQWiki, a pipeline for extracting and validating clean monolingual content. While initially developed for Wikipedia data, our framework introduces universal approaches to dataset preparation that can be applied to various content sources. The pipeline implements a novel multi-stage approach combining character set validation, context-aware filtering, and adaptive language detection. Each processing stage addresses specific quality aspects through configurable filters: structural analysis for metadata removal, language boundary detection, and content validation. The system supports integration of multiple language identification tools with customisable confidence thresholds, making it particularly effective for languages where standard tools provide insufficient accuracy. Evaluation across multiple language families demonstrated the effectiveness of our approach, with processed datasets showing significant reduction in noise compared to raw Wikipedia exports. The framework maintained high accuracy in language identification even for closely related languages. Importantly, the modular architecture allows researchers to adapt filtering criteria and extend the pipeline for specific research requirements. The toolkit provides a foundation for systematic dataset creation methodology that can be applied across different language families and research purposes. By open-sourcing both our framework and creating monolingual datasets, we enable researchers to create and maintain their own high-quality datasets while ensuring transparency and reproducibility in the data preparation process.
Format: Paper (20 minutes presentation)
Submission Number: 32
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview