O²-CritiCuRL: Offline-Online Step-aware Curriculum Reinforcement Learning for Visual Reasoning

Wendi Deng; Hang Du; Guoshun Nan; HaoKun Tian; Jiaqi Yu; Jiale Li; Xinlei Cao; Ji Zhang; Jun Liu; Xudong Jiang; Sicong Leng

O²-CritiCuRL: Offline-Online Step-aware Curriculum Reinforcement Learning for Visual Reasoning

Wendi Deng, Hang Du, Guoshun Nan, HaoKun Tian, Jiaqi Yu, Jiale Li, Xinlei Cao, Ji Zhang, Jun Liu, Xudong Jiang, Sicong Leng

18 Sept 2025 (modified: 27 Sept 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Reasoning

Abstract: Large language models demonstrate strong capabilities in reasoning tasks, yet they frequently generate flawed intermediate reasoning steps while still arriving at correct final answers. Such behavior raises concerns about interpretability and reliability, as it suggests reliance on spurious shortcuts rather than faithful reasoning. Existing attempts to incorporate step-level supervision are limited by long, redundant trajectories that burden optimization and obscure decisive reasoning steps. We propose O²-CritiCuRL, a novel curriculum reinforcement learning framework that explicitly models critical-step awareness through an iterative offline–online training paradigm. In the offline stage, O²-CritiCuRL decomposes chain-of-thought trajectories and employs a step-level reward to automatically identify decisive steps while down-weighting redundant ones, followed by restructuring trajectories into difficulty tiers for curriculum learning. In the online stage, we introduce a progressive step-level reinforcement learning strategy, where truncated reasoning chains encourage the model to infer missing steps and refine its reasoning process. These two stages are coupled through an iterative offline–online mechanism, enabling the model to progressively improve its focus on critical steps and overcome the limitations of static supervision. Extensive experiments across multiple reasoning benchmarks demonstrate that our method achieves the state-of-the-art performance.

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2026/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11139

Loading