Rethinking Reward Models! A Conceptual Framework for Enhancing LLM Reasoning through Intrinsic Traits

Het Riteshkumar Shah; Megha Sundriyal

Rethinking Reward Models! A Conceptual Framework for Enhancing LLM Reasoning through Intrinsic Traits

Het Riteshkumar Shah, Megha Sundriyal

Published: 18 Nov 2025, Last Modified: 20 Jan 2026PLAN-FM Bridge @ AAAI 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reasoning, Intrinsic Rewards, Large Language Models, Sequential coherence, Non-cyclic reasoning, Principled tool utilization, Query Alignment

TL;DR: Shifting from outcome-based RL to principle-based RL for LLM alignment using intrinsic traits based reward function.

Abstract: Post-training alignment is crucial for refining the reasoning capabilities of Large Language Models (LLMs). A dominant paradigm for this involves optimizing the model's policy using reinforcement learning, powered by techniques such as Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO). The success of these methods, whether using an explicit reward model or optimizing directly on preference data, is critically dependent on the quality of the guiding signal. However, these signals are conventionally derived from task-specific outcomes, such as correctness in math or fluency in summarization. This approach often limits the model's ability to generalize its reasoning skills across diverse domains and can lead to reward hacking or model collapse. This paper challenges this outcome-based paradigm by introducing a conceptual framework, GRIT (Generalizable Reasoning via Intrinsic Traits). This novel framework aims to shift the emphasis from rewarding what the model answers to how it reasons. To accomplish this, we define a set of universal, task-agnostic traits of sound cognition inspired by human reasoning. These intrinsic traits are encoded as distinct reward components: (1) ensuring sequential logical coherence, (2) penalizing cyclic or redundant reasoning, (3) rewarding successful and integrated tool utilization, and (4) maintaining semantic alignment with the user's query. By fine-tuning an LLM to optimize for these intrinsic traits, we hypothesize that the model will develop a more robust and generalizable cognitive process.

Submission Number: 8

Loading