FormulaSPIN: Self-Play Fine-Tuning for Natural Language to Spreadsheet Formula Generation

FormulaSPIN: Self-Play Fine-Tuning for Natural Language to Spreadsheet Formula Generation

ACL ARR 2026 January Submission875 Authors

25 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: self-play fine-tuning, natural language to formula, spreadsheet formula generation, execution feedback, curriculum learning, test-time scaling, consensus polling, NL2Formula, preference optimization

Abstract: Spreadsheet applications are used by hundreds of millions worldwide, yet writing formulas remains a significant barrier. Despite recent progress in Natural Language to Formula, existing approaches rely on static supervised data, which quickly saturates on limited annotations. In this paper, we introduce FORMULASPIN, a self-play framework that breaks the ceiling of supervised fine-tuning by enabling iterative self-improvement beyond the data constraint, exploiting formula generation's unique advantage: binary executability provides implicit supervision to distinguish semantic errors from valid alternatives. Our method frames training as a two-player game where the main player learns to prefer ground-truth formulas over those generated by its previous version, while execution feedback categorizes outputs into distinct granularities—enabling an adaptive curriculum that shifts from semantic correctness to stylistic refinement. To further scale test-time compute, we incorporate a semantic-level consensus polling mechanism that naturally handles multiple valid formulations. Experiments on multiple benchmarks demonstrate that FORMULASPIN achieves state-of-the-art performance, with 78.4% exact match and 84.2% execution accuracy on NL2FORMULA, matching models trained with 60K additional preference annotations while outperforming both traditional SFT (by 15.3%) and agent-like prompting methods leveraging GPT-4. These findings underscore self-play's potential to tackle scarce data tasks and open the door to extending it beyond executable domains.

Paper Type: Long

Research Area: Hierarchical Structure Prediction, Syntax, and Parsing

Research Area Keywords: self-supervised learning, reinforcement learning, structured prediction, optimization methods, generalization, transfer learning / domain adaptation

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 875

Loading