Abstract: Recent efforts have extended the capabilities of transformers in logical reasoning and symbolic computations. In this work, we investigate their capacity for non-linear latent pattern discovery in the context of functional decomposition, focusing on the challenging algebraic task of multivariate polynomial decomposition. This problem, with widespread applications in science and engineering, is proved to be NP-hard, and demands both precision and insight. Our contributions are threefold: First, we develop a synthetic data generation pipeline providing fine-grained control over problem complexity. Second, we train transformer models via supervised learning and evaluate them across four key dimensions involving scaling behavior and generalizability. Third, we propose Beam Grouped Relative Policy Optimization (BGRPO), a rank-aware reinforcement learning method suitable for hard algebraic problems. Fine-tuning with rank-aware BGRPO lifts beam-search accuracy by 34--37 percentage points over the SFT initialization and by 1.7--3.7 points over vanilla GRPO, with non-overlapping $\pm 1\sigma$ bands at every scale. After RL, even greedy decoding surpasses the SFT model's best beam-search score by 23.8--25.8 percentage points at every scale. Additionally, our model demonstrates competitive performance in polynomial simplification, outperforming Mathematica in various cases.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Since the last submission we re-trained BGRPO from the SFT initializations and added vanilla GRPO baselines across 3 random seeds, evaluated on a 200-problem multi-variable test set, and corrected an implementation issue: the advantage was being standard-deviation normalized rather than mean-baselined as the paper specified. With the corrected code, RL gains are substantially larger than originally reported: rank-aware BGRPO lifts beam@30 accuracy by 34–37 percentage points over the SFT initialization and by 1.7–3.7 points over vanilla GRPO with non-overlapping ±1σ bands at every scale, and greedy decoding after BGRPO already surpasses the SFT model's best beam-search score by 23.8–25.8 points. We retract the original "~75% lower inference compute / 50% smaller beam width" claim, which rested on an incorrect quadratic-in-beam-width cost model (the correct linear scaling gives ~50%), and reframe BGRPO as primarily an accuracy-improvement method rather than a decoding-efficiency one. New components include a four-way comparison (SFT, vanilla GRPO, BGRPO with a binary reward, BGRPO with the rank-aware reward) replacing the previous Figure 8, a decay-base ablation showing the rank reward is highly sensitive to its only hyperparameter (peak-to-trough ~10 accuracy points, peak at b=20 near the beam width), and a greedy-vs-SFT-beam comparison table. We also corrected the training objective in §2.3 to the per-token surrogate that matches the implementation, introduced the decay base $b$ as a hyperparameter separate from beam width $w$, unified terminology across sections, and added a test-set ceiling caveat (the three rank-aware models cluster to within one point at ~55% beam@30 on the 200-problem set) to the limitations.
Assigned Action Editor: ~Martin_Mundt1
Submission Number: 7755
Loading