Keywords: LLM, LLM Reasoning, Consistency
Abstract: Large language models excel at many tasks but still struggle with consistent, robust reasoning. We introduce Cohort-based Consistency Learning (CC-Learn), a reinforcement learning framework that trains on cohorts of similar questions instantiated from symbolic programmatic abstractions and executes a programmatic solution unchanged across each cohort. Our composite objective mixes execution-based signals with critique-based signals. The execution-based signals include cohort-level accuracy, retrieval usage, and penalties for invalid lookups. The critique-based signals come from a frozen judge that checks whether the program’s sub-questions cover the key factors and whether its reasoning logic moves closer to a higher-quality self-improvement. Optimized via reinforcement learning, this objective steers the policy toward uniform, generalizable procedures rather than instance-specific shortcuts. Across five in-domain benchmarks (ARC-Easy/Challenge, CSQA, StrategyQA, HotpotQA) and three out-of-domain benchmarks (OpenBookQA, PubMedQA, MMLU), at two model scales (3B/7B), CC-Learn delivers roughly 10–20 absolute-point gains over strong baselines under both lenient and strict criteria, improving accuracy and stabilizing reasoning. These results show that cohort-level RL with execution signals and external feedback effectively enforces cross-variant consistency in LLMs.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 9588
Loading