Power-Flow: Unlocking LLMs with $\alpha$-Power Distribution Fine-Tuning

06 Nov 2025 (modified: 11 Nov 2025)THU 2025 Fall AML SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: GFlowNet, unsupervised fine-tuning of LLMs
Abstract: Fine-tuning Large Language Models (LLMs) with Reinforcement Learning (RL) effectively enhances their capabilities but typically relies on costly external reward signals. While recent self-rewarding methods offer an alternative, they often use heuristic rewards with unclear learning objectives. We posit that many advanced skills, such as reasoning and creativity, are already latent within the base model and can be activated by sampling from its power distribution, $p_{\text{base}}(x)^\alpha$. However, existing sampling methods like MCMC are inefficient at inference time. We propose a novel unsupervised fine-tuning framework using Generative Flow Networks (GFlowNets) to directly train a policy that samples from this target $\alpha$-power trajectory distribution. We define an intrinsic reward signal based on the trajectory density of the base model, calculated using the frozen base model itself. This principled approach provides a unified mechanism to controllably unlock latent abilities: setting $\alpha > 1$ enhances reasoning by "sharpening" the distribution, while $\alpha < 1$ unlocks creative diversity by "flattening" it. We plan to demonstrate the effectiveness of our method on reasoning and creative generation benchmarks.
Submission Number: 4
Loading