PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch

10 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Alignment, RLHF, Post-Training, Preference Optimization
Abstract: Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models (LLMs). However, its effectiveness critically depends on high-quality instruction data. Most existing high-quality alignment datasets are either private or require costly human annotation, which hinders reproducibility and scalability. Even with the emergence of Reinforcement Learning from AI Feedback (RLAIF), concerns about data quality remain. Moreover, it is still unclear to the open-source community how much data is actually required to fine-tune a base model into a strong instruction-following model. Current state-of-the-art approaches typically rely on over 300k examples even at the supervised fine-tuning (SFT) stage, yet they still underperform compared to proprietary models, leaving substantial barriers for academic and resource-constrained communities. To address this gap, we introduce PiKa, a data-efficient family of expert-level alignment datasets. In particular, the PiKa-SFT dataset uses only 30k SFT examples—an order of magnitude fewer than the SoTA dataset Magpie. Through extensive evaluations by fine-tuning Llama-3-8B-Base on PiKa and other public instruction following datasets, we show that PiKa-SFT alone outperforms models trained on much larger datasets. Remarkably, on two widely used alignment benchmarks, AlpacaEval 2.0 and Arena-Hard, PiKa-SFT fine-tuning surpasses the official Llama-3-8B-Instruct model -- which was trained on over 10M proprietary examples. We further extend our study by training the Qwen2.5 series (0.5B–7B) on PiKa-SFT, consistently outperforming their official instruction-tuned counterparts. In addition, we curate 30k high-quality preference optimization examples, which further improve alignment performance when applied after SFT initialization. These findings demonstrate that high-quality alignment can be achieved with significantly reduced data, providing a practical and scalable path for advancing open-source LLM alignment research. Our code and data will be available at https://anonymous.4open.science/r/PiKa.
Primary Area: datasets and benchmarks
Submission Number: 3777
Loading