Abstract: Evaluation of machine learning models typically emphasizes final accuracy, overlooking the cost of adaptation: the cumulative errors incurred while learning from scratch. Guess-and-Learn (G&L) v1.0 addresses this gap by measuring cold-start adaptability - the total mistakes a model makes while sequentially labeling an unlabeled dataset. At each step, the learner selects an instance, predicts its label, receives the ground truth, and updates parameters under either online (per-sample) or batch (delayed) mode. The resulting error trajectory exposes adaptation speed, selection quality, and bias - dynamics invisible to endpoint metrics. G&L defines four tracks (Scratch/Pretrained $\times$ Online/Batch) to disentangle the effects of initialization and update frequency. We formalize the protocol, relate it to classical mistake-bound theory, and estimate a heuristic ``oracle reference band'' for MNIST as a plausibility reference. Baseline experiments on MNIST and AG~News, spanning classical methods (Perceptron, $k$-NN), convolutional architectures (CNN, ResNet-50), and pretrained transformers (ViT-B/16, BERT-base), reveal systematic differences in early-phase efficiency: smaller models can adapt with fewer initial errors, while pretraining benefits vary by domain. Across settings, current models remain well above the oracle band, highlighting an adaptability gap. By quantifying the mistake cost of early learning, G&L complements conventional benchmarks and provides a reproducible framework for developing learners that are not only accurate in the limit but also reliable from the first examples.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: * **De-anonymization and header updates.** Updated the manuscript header to reflect acceptance in Transactions on Machine Learning Research (02/2026). Added author name, affiliation (Independent Researcher), contact email, and the OpenReview forum link.
* **Code release and reproducibility details.** Replaced the double-blind placeholder with the permanent public GitHub repository link and pinned commit hash (5653338). Added a corresponding footnote on the first page to ensure reproducibility.
* **Table 2 revision.** Updated Table 2 to report mean ± standard error for the Random baseline column (previously point estimates only), enabling clearer comparison with best-performing strategies.
* **Expanded discussion in Section 6.1.** Added a clarifying paragraph under “Agility can outweigh capacity” to more precisely interpret the capacity–agility findings. The revision frames early-phase inefficiency in high-capacity models as reflecting structural properties of current architectures and training regimes, and explicitly notes that rapid adaptation need not inherently trade off with representational capacity.
* **Acknowledgements.** Replaced the double-blind placeholder with a formal acknowledgement thanking the Action Editor and anonymous reviewers.
* **Minor corrections.** Addressed formatting and capitalization issues in the references (e.g., BERT, PCM, MNIST) and made small wording refinements for clarity.
*No experimental results, figures, or quantitative findings were altered.*
Code: https://github.com/RolandWArnold/guess-and-learn-benchmark
Supplementary Material: zip
Assigned Action Editor: ~Pierre_Ablin2
Submission Number: 6180
Loading