WIST: Web-Grounded Iterative Self-Play Tree for Domain-Targeted Reasoning Improvement

ACL ARR 2026 January Submission10740 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Self-Play, Tree-Structured Exploration, Open-Web
Abstract: Recent progress in reinforcement learning with verifiable rewards (RLVR) offers a practical path to self-improving language models, but self-evolution methods face a key trade-off: endogenous self-play can drift over iterations, while corpus-grounded approaches rely on curated data environments. We present WIST, a Web-grounded Iterative Self-play Tree framework for domain-targeted reasoning improvement that learns directly from the open web without requiring any pre-arranged domain corpus. WIST incrementally expands a domain tree to structure exploration and retrieves and cleans path-consistent web evidence to construct a controllable training environment. It then performs Challenger--Solver self-play with verifiable rewards, and feeds learnability signals back to update node posteriors and guide subsequent exploration through an adaptive curriculum. Across four backbones, WIST consistently improves over the base models and typically outperforms both purely endogenous self-evolution and corpus-grounded self-play baselines, with the Overall gains reaching +9.8 (Qwen3-4B-Base) and +9.7 (OctoThinker-8B-Hybrid-Base). WIST is also domain-steerable: switching the target domain to physics yields a +5.28 EED gain on PhyBench. Ablations further confirm the importance of tree-structured organization, posterior-guided exploration, and sliding-window updates for stable open-web learning.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: fine-tuning,continual learning
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 10740
Loading