Random Selection Reveals Implicit Knowledge Consensus in Code Generation

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Random sampling is a strong low-cost default for selecting training solutions in verifiable multi-solution code-generation datasets.
Abstract: Training large language models for code generation often involves selecting data from verifiable multi-solution pools, where each problem admits multiple correct implementations. Conventional studies on data selection suggest that complex selection strategies, such as diversity maximization or difficulty ranking, should outperform naive random sampling. In this work, we systematically evaluate within-problem solution selection strategies across different representation spaces, including continuous embeddings, discrete tokens, and syntactic structures, using various base language models. Instead, simple random sampling achieves consistently competitive performance across all models, exhibiting greater cross-model stability than complex methods. We interpret these results through the lens of *implicit knowledge consensus*: verified solution pools may contain representative algorithmic patterns that random sampling can preserve. Our findings suggest that practitioners should treat random sampling as a low-cost default for verifiable code-generation fine-tuning and move to complex selectors when hard-tail or constraint-focused coverage is the target.
Lay Summary: Many programming tasks have more than one correct answer. We asked a simple question: when training an AI system to write code, is it better to carefully choose a small set of examples, or can we just pick examples at random? We tested several ways of choosing training examples on multiple AI systems. To our surprise, random choice was usually just as good as more complicated methods, and sometimes better. One reason may be that correct solutions often share the same important ideas, so random sampling already captures much of what the system needs to learn. In harder cases, or when we want to cover different kinds of solutions, more targeted selection can still help. Overall, our results suggest that random sampling is a strong, low-cost default for training code-writing systems.
Originally Submitted Supplementary Material: zip
Primary Area: Deep Learning->Large Language Models
Keywords: Data Selection, Supervised Fine-Tuning, Large Language Models
Originally Submitted PDF: pdf
Submission Number: 20813
Loading