WIST: Web-Grounded Iterative Self-Play Tree for Domain-Targeted Reasoning Improvement

WIST: Web-Grounded Iterative Self-Play Tree for Domain-Targeted Reasoning Improvement

ACL ARR 2026 January Submission10740 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Self-Play, Tree-Structured Exploration, Open-Web

Abstract: Recent progress in reinforcement learning with verifiable rewards (RLVR) offers a practical path to self-improving language models, but self-evolution methods face a key trade-off: endogenous self-play can drift over iterations, while corpus-grounded approaches rely on curated data environments. We present WIST, a Web-grounded Iterative Self-play Tree framework for domain-targeted reasoning improvement that learns directly from the open web without requiring any pre-arranged domain corpus. WIST incrementally expands a domain tree to structure exploration and retrieves and cleans path-consistent web evidence to construct a controllable training environment. It then performs Challenger--Solver self-play with verifiable rewards, and feeds learnability signals back to update node posteriors and guide subsequent exploration through an adaptive curriculum. Across four backbones, WIST consistently improves over the base models and typically outperforms both purely endogenous self-evolution and corpus-grounded self-play baselines, with the Overall gains reaching +9.8 (Qwen3-4B-Base) and +9.7 (OctoThinker-8B-Hybrid-Base). WIST is also domain-steerable: switching the target domain to physics yields a +5.28 EED gain on PhyBench. Ablations further confirm the importance of tree-structured organization, posterior-guided exploration, and sliding-window updates for stable open-web learning.

Paper Type: Long

Research Area: Language Models

Research Area Keywords: fine-tuning,continual learning

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 10740

Loading