Surprise-Modulated Meta-Advantages in Reinforcement Learning: Towards Language-Neutral Post-Training for Code LLMs
Keywords: Reinforcement Learning, Groupwise Meta-Normalization, Multilingual Code Generation
Abstract: Large language models are more beneficial for code generation in mainstream languages such as Python and JavaScript, however, they are very ineffective for resource-constrained languages such as Fortran, OCaml, and R. We rephrase this discrepancy not as a consequence of inevitable data lack of information, but as a problem in learning efficiency. In this work, we present PolyCode, which is trained by a groupwise meta-normalised Proximal Policy Optimization (PPO) which we refer to as GMPO. GMPO is a standard PPO-clip objective that has two new additions: (i) Cross-Group Meta-Normalization (CGMN) that suppresses variance by collecting meta-statistics across prompt similarities, and (ii) Surprise-Based Advantage Modulation (SBAM) that gives preference to updates where the reward signal deviates from a relative confidence of the model. We consequently enforce language neutrality of evaluation by input and output only by binary reward r in either 0 or 1 for exact conformity, and thus avoid the need for unit test translation across languages. Empirically, PolyCode-4B always matches or significantly exceeds smaller baselines on our Ag-LiveCodeBench-X benchmark with considerable improvements over WPLL for Fortran and OCaml. For a standardised reporting, pass@1 is defined as a Monte Carlo estimate derived from multiple single-sample trials (single draw 20 times per prompt reactance at T=0.2), but the best of selection and voting were not used during implementation.
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 9894
Loading