Improving the Language Understanding Capabilities of Large Language Models Using Reinforcement Learning

Improving the Language Understanding Capabilities of Large Language Models Using Reinforcement Learning

TMLR Paper3929 Authors

09 Jan 2025 (modified: 24 Apr 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs), primarily built on decoder-only transformer architectures, excel in natural language generation tasks and have shown promise in adapting to diverse downstream tasks using zero-shot and few-shot prompting techniques. However, these prompting methods often fall short on natural language understanding (NLU) tasks, where smaller encoder-only models like BERT-base consistently outperform LLMs on benchmarks such as GLUE and SuperGLUE. In this paper, we explore two approaches—supervised fine-tuning and proximal policy optimization (PPO)—to enhance the NLU capabilities of LLMs. To reduce the computational cost of full-model fine-tuning, we integrate low-rank adaptation (LoRA) layers, restricting updates to these layers during both supervised fine-tuning and PPO stages. In the supervised fine-tuning approach, task-specific prompts are concatenated with input queries and ground-truth labels from the NLU training corpus, optimizing the model using the next-token prediction objective. Despite this, LLMs still underperform compared to encoder-only models like BERT-base on several NLU tasks. To address this gap, we employ PPO, a reinforcement learning technique that treats each token generation as an action and evaluates the sequence of generated tokens using a reward function based on their alignment with ground-truth answers. PPO then updates the model to maximize these rewards, effectively aligning its outputs with the correct labels. Our experiments with the LLAMA2-7B-chat-hf model demonstrate that PPO-based fine-tuning significantly improves performance, delivering an average gain of 6.3 points over supervised fine-tuning on the GLUE benchmark. PPO surpasses zero-shot prompting by 38.7 points and few-shot prompting by 26.1 points on GLUE, while also outperforming these baselines by 28.8 and 28.5 points on SuperGLUE. Additionally, PPO exceeds the performance of BERT-large, a strong baseline, with an average improvement of 2.7 points on GLUE and 9.3 points on SuperGLUE. These improvements are consistent across models such as Qwen2.5-7B-Instruct and MPT-7B-chat, highlighting PPO’s robustness and effectiveness in improving the NLU capabilities of LLMs. Furthermore, LLAMA2-7B-chat-hf and LLAMA2-13B-chat-hf models fine-tuned with PPO on a single dataset exhibit strong zero-shot generalization across diverse unseen datasets. On average, they outperform GPT-4o by over 4% on sentiment analysis and natural language inference tasks, achieving notable gains of 7.3% on the Mental Health dataset and more than 10.9% on SIGA-nli.

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Yu_Meng1

Submission Number: 3929

Loading