RL Beats SFT while Mitigating Definition Bias in LLM-based Information Extraction

RL Beats SFT while Mitigating Definition Bias in LLM-based Information Extraction

ICLR 2026 Conference Submission17523 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large languange models, reinfocement learning, information extraction

Abstract: While large language models (LLMs) have been able to provide generally reasonable answers to complex information extraction (IE) tasks through prompt engineering and supervised fine-tuning (SFT), their performance and safety remain limited. We propose a novel fuzzy matching method to reveal that this is largely due to the definition bias between the model and the dataset. To mitigate this problem without human intervention, we use Reinforcement Learning with Verifiable Rewards (RLVR) to train the model, enabling it to independently learn the inherent definition of the task from the dataset. Specifically, we use Group Relative Policy Optimization (GRPO) to train LLMs of varying parameter sizes, rewarded with micro F1 scores, and achieve notably higher precision and recall than SFT across all models. We then apply fuzzy matching again to statistically demonstrate that this improvement is mainly primarily to the mitigation of the definition bias between the model and the dataset.

Primary Area: reinforcement learning

Submission Number: 17523

Loading