End-to-End QA Construction Pipeline for Continual Pre-training of Large Language Models

12 Sept 2025 (modified: 14 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, SimpleQA, Continual Pre-training, Pre-training, QA data construction, Data Construction Pipeline
Abstract: As Large Language Models (LLMs) evolve into proficient AI assistants, the demand for high-quality data becomes increasingly critical. Existing methods to create question-answer (QA) datasets often depend on limited self-generated data from LLMs or labor-intensive manual annotations, which restrict both the scope and size of the resulting datasets. To overcome these challenges, we propose a comprehensive pipeline for acquiring and filtering high-quality QA data from web searches, utilizing the vast and diverse content available online. Our approach includes training the High-Quality Knowledge Model, which ensures dataset robustness by filtering queries based on clarity and static knowledge criteria. Additionally, we introduce the Knowledge Boundary Model to pinpoint and address knowledge boundaries within LLMs, enhancing their ability to manage novel scenarios effectively. Our approach not only results in the generation of an extensive QA dataset but also implements training strategies that boost LLM capabilities. Our method improves the baseline by 22.96% on Chinese SimpleQA, 4.66% on SimpleQA, 4.78% on seven single-hop datasets, and 17.47% on eight multi-hop datasets. Our code and data will be released.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 4404
Loading