A Data-Scaling Sweet Spot in Structured Algorithmic Learning

Published: 29 May 2026, Last Modified: 01 Jun 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: data scaling, learning dynamics, grokking, memorization, structured-output learning, transformers
Abstract: Data scaling is usually framed through an onset question: how much data is needed before a model generalizes? We study a different, update-based scaling question: after generalization is possible, which dataset size reaches a fixed validation exact-match threshold in the fewest optimizer updates under a fixed training protocol? In Needleman--Wunsch (NW) matrix generation, a structured-output dynamic-programming task, small Transformers reach high validation exact-match accuracy in the fewest optimizer updates at an intermediate dataset size, not at the largest one. Past this sweet spot, larger datasets still generalize but require more gradient updates. A random-suffix probe strengthens the interpretation: when an unstructured random suffix is appended to each NW target, the rule-governed NW component reaches high training accuracy earlier than the arbitrary suffix, and this structured--random gap widens in the same regimes where validation improves. Together with the observation that, near the onset of weak validation competence, increasing data can reduce the updates needed to reach high training exact-match accuracy, this shows that data scaling changes both rule discovery and residual exact fitting. The result separates the critical data size for making generalization attainable from the update-optimal data size for reaching a fixed validation exact-match threshold.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 79
Loading