everyone
since 13 Oct 2023">EveryoneRevisionsBibTeX
To alleviate text degeneration of large-scale language models and meet the requirements of real-world applications, it is essential to make generation more controllable. Previous reinforcement learning (RL) research on language modeling generally learns from sentence-level feedback, which requires extensive exploration to collect enough trajectories, and more steps to learn contributory components from a noisy trajectory corpus. To tackle that, we propose a novel reinforcement learning algorithm with FIne-grained REward (FIRE). We derive an extensible fine-grained reward function and ease the trade-off between reward approximation and training stability. We present a theoretical connection between our approach and canonical policy-gradient RL methods. Experimental results show that FIRE can achieve superior controllability of language models with less computational overheads compared to prior RL approaches.