CC-Learn: Cohort-Based Consistency Learning

Xiao Ye; Zhaonan Li; Jacob Dineen; Zhikun Xu; Shijie Lu; Ming Shen; Shaswat Shrivastava; Avneet Ahuja; Ben Zhou

CC-Learn: Cohort-Based Consistency Learning

Xiao Ye, Zhaonan Li, Jacob Dineen, Zhikun Xu, Shijie Lu, Ming Shen, Shaswat Shrivastava, Avneet Ahuja, Ben Zhou

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, LLM Reasoning, Consistency

Abstract: Large language models excel at many tasks but still struggle with consistent, robust reasoning. We introduce Cohort-based Consistency Learning (CC-Learn), a reinforcement learning framework that trains on cohorts of similar questions instantiated from symbolic programmatic abstractions and executes a programmatic solution unchanged across each cohort. Our composite objective mixes execution-based signals with critique-based signals. The execution-based signals include cohort-level accuracy, retrieval usage, and penalties for invalid lookups. The critique-based signals come from a frozen judge that checks whether the program’s sub-questions cover the key factors and whether its reasoning logic moves closer to a higher-quality self-improvement. Optimized via reinforcement learning, this objective steers the policy toward uniform, generalizable procedures rather than instance-specific shortcuts. Across five in-domain benchmarks (ARC-Easy/Challenge, CSQA, StrategyQA, HotpotQA) and three out-of-domain benchmarks (OpenBookQA, PubMedQA, MMLU), at two model scales (3B/7B), CC-Learn delivers roughly 10–20 absolute-point gains over strong baselines under both lenient and strict criteria, improving accuracy and stabilizing reasoning. These results show that cohort-level RL with execution signals and external feedback effectively enforces cross-variant consistency in LLMs.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 9588

Loading