Keywords: Long context, RLVR, Unsupervised
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a prominent paradigm to enhance the capabilities (i.e.\ long-context) of Large Language Models (LLMs).
However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming.
In this work, we investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models' supervision.
Specifically, we first replace a few paragraphs with special placeholders in a long document.
LLMs are then trained through reinforcement learning to reconstruct the long document by correctly identifying and sequencing missing paragraphs from a set of candidate options.
This training paradigm enables the model to capture global narrative coherence, significantly boosting long-context performance.
We validate the effectiveness of our method on two widely used benchmarks, RULER and LongBench v2.
While acquiring noticeable gains on RULER (nearly 10 points), it can also achieve a reasonable improvement on LongBench v2 without any manually curated long-context QA data.
Furthermore, we conduct extensive ablation studies to analyze the impact of reward designs, data curation strategies, training schemes, and data scaling effects on model performance.
We will release our code, data, and models.
Paper Type: Long
Research Area: Efficient Methods for NLP
Research Area Keywords: Efficient/Low-Resource Methods for NLP, Language modeling
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 705
Loading