Improving the Language Understanding Capabilities of Large Language Models Using Reinforcement Learning
Abstract: Large language models (LLMs), primarily built on decoder-only transformer architectures, excel in natural language generation tasks and have shown promise in adapting to diverse downstream tasks using zero-shot and few-shot prompting techniques. However, these prompting methods often fall short on natural language understanding (NLU) tasks, where smaller encoder-only models like BERT-base consistently outperform LLMs on benchmarks such as GLUE and SuperGLUE. In this paper, we explore two approaches—supervised fine-tuning and proximal policy optimization (PPO)—to enhance the NLU capabilities of LLMs. To reduce the computational cost of full-model fine-tuning, we integrate low-rank adaptation (LoRA) layers, restricting updates to these layers during both supervised fine-tuning and PPO stages. In the supervised fine-tuning approach, task-specific prompts are concatenated with input queries and ground-truth labels from the NLU training corpus, optimizing the model using the next-token prediction objective. Despite this, LLMs still underperform compared to encoder-only models like BERT-base on several NLU tasks. To address this gap, we employ PPO, a reinforcement learning technique that treats each token generation as an action and evaluates the sequence of generated tokens using a reward function based on their alignment with ground-truth answers. PPO then updates the model to maximize these rewards, effectively aligning its outputs with the correct labels. Our experiments with the LLAMA2-7B-chat-hf model demonstrate that PPO-based fine-tuning significantly improves performance, delivering an average gain of 6.3 points over supervised fine-tuning on the GLUE benchmark. PPO surpasses zero-shot prompting by 38.7 points and few-shot prompting by 26.1 points on GLUE, while also outperforming these baselines by 28.8 and 28.5 points on SuperGLUE. Additionally, PPO exceeds the performance of BERT-large, a strong baseline, with an average improvement of 2.7 points on GLUE and 9.3 points on SuperGLUE. These improvements are consistent across models such as Qwen2.5-7B-Instruct and MPT-7B-chat, highlighting PPO’s robustness and effectiveness in improving the NLU capabilities of LLMs. Furthermore, LLAMA2-7B-chat-hf and LLAMA2-13B-chat-hf models fine-tuned with PPO on a single dataset exhibit strong zero-shot generalization across diverse unseen datasets. On average, they outperform GPT-4o by over 4% on sentiment analysis and natural language inference tasks, achieving notable gains of 7.3% on the Mental Health dataset and more than 10.9% on SIGA-nli.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Yu_Meng1
Submission Number: 3929
Loading