\section{Related Work}

\paragraph{Principle-Driven Language Modeling}

Early work in principle-driven alignment demonstrated that embedding high-level rule sets or “constitutions” into the training loop can steer model behavior without direct human labels for each generation. Constitutional AI \citep{bai2022constitutionalaiharmlessnessai} introduced a two-stage process in which a pre-trained model first generates critiques of its own outputs against a static constitution curated and synthesized a priori and learns from these critiques, then trains a preference model from on-policy data and performs RL. Dromedary \citep{dromedary} extended this idea by introducing Self-Align, an algorithm which generates prompts, applies a small set of human written principles with in-context learning, and then fine-tunes the model to learn the principle-guided responses; this was later extended by SALMON \citep{sun2024salmon} to design an instructable, principle-following reward model (RM). ConstitutionalExperts \citep{petridis-etal-2024-constitutionalexperts} and SPRI \citep{zhan2025sprialigninglargelanguage} address the problem of mapping prompts to principles. Deliberative Alignment \citep{guan2025deliberativealignmentreasoningenables} introduces CoT reasoning over safety specifications, and trains models to learn these reasoning traces via SFT and online RL with a safety RM. 

More recent efforts have sought to leverage models to draft and refine constitutions with limited human supervision. LEAP \citep{zhang2024incontextprinciplelearningmistakes} showed that models can propose new principles via self-critique given the gold response, synthesizing a list of task-specific principles that can be used for in-context learning. SAMI \citep{franken} introduced a self-alignment mechanism where a strong model is used to generate a small set of principle candidates, and the target model is trained to maximize the mutual information between the constitution and the model’s responses through an InfoNCE loss. IterAlign \citep{chen-etal-2024-iteralign} and ICAI \citep{findeis2025inverse} also leverage strong frontier models such as GPT-4/GPT-4o \citep{openai2024gpt4ocard} for constitution proposal. The former uses a red-teaming framework to identify model weaknesses, and uses the strong model to propose a principle towards helpfulness, harmlessness, and honesty; the latter considers principles as specifications over preference annotations, injecting them to reconstruct the annotation. Most recently, DeepSeek introduced a pointwise generative reward model \citep{liu2025inferencetimescalinggeneralistreward} over self-generated principles, demonstrating that such an RM can be successfully used to improve inference-time scaling. 

\paragraph{Self-Correction and Self-Improvement}
The recent emergence of large reasoning models such as o3 and DeepSeek-R1 \citep{openai2025o3_o4mini_systemcard,deepseekai2025deepseekr1incentivizingreasoningcapability} has lead to a growing exploration into the ability of models to perform intrinsic refinement of their own outputs with internal feedback. However, much of the prior literature below either trains models over the improved responses alone or performs prompted self-refinement at inference-time, rather than \textit{learning} this ability. STaR first leveraged model-generated critiques followed by revision, showing that alternating these two stages yields gains in instruction following \citep{zelikman}. Self-Refine \citep{madaan} explored the setting of prompted multi-turn refinement: after producing an initial answer, the model is induced to critique itself, reflecting on the weaknesses of the current response on specific dimensions, proposing actionable changes, and reflecting them in the corrected response. ProMiSe \citep{ramji2024selfrefinementlanguagemodelsexternal} extended this ability to smaller language models with weaker self-critique and refinement capabilities, showing that greedily refining responses relative to one attribute in a sequential manner improves performance, while demonstrating that training on synthetic dialogues modeling this self-refinement process improves dialogue question answering. APO \citep{doosterlinck2024anchoredpreferenceoptimizationcontrastive} addressed the notion of minimally contrastive preference pairs, reinforcing the notion that revision along fewer attributes yields a better signal for preference optimization. 
Recently, reinforcement learning strategies have been explored to train models to directly perform self-correction. RISE \citep{RISE} frames self-correction through a multi-turn Markov decision process (MDP), performs best-of-N sampling over sequential attempt rollouts, and uses offline RL to train the model to correct over these trajectories. SCoRe \citep{kumar2025training} improves over this by a multi-turn RL formulation that boosts the quality of the initial attempt and leverages reward shaping to incentivize self-correction to improve the refined response. Beyond individual output corrections, recent work has shown that models can bootstrap their underlying capabilities in an iterative fashion over time. Several works suggest that sampling diverse responses or environmental interactions, filtering based on feedback or majority voting, and training on these on-policy generations can boost performance \citep{huang-etal-2023-large,patel2024largelanguagemodelsselfimprove}. SPIN \citep{SPIN} introduced a self-play fine-tuning approach, wherein the model compares its generations against the ground-truth annotated responses in the SFT dataset yielding a preference pair, fine-tunes with a contrastive objective, and repeats this process iteratively. \citet{huang2025selfimprovement} theoretically formalizes the self-improvement phenomenon through a "\textit{sharpening}" process, wherein the model's policy moves towards maximizing its generations' self-reward.

\paragraph{Latent Chain-of-Thought Learning}

Chain-of-thought (CoT) prompting 
\citep{wei2023chainofthoughtpromptingelicitsreasoning, kojima} elicits an explicit, verbalized step-by-step walkthrough of the reasoning trace guiding the model from the input to the final response. Simultaneously, STaR \citep{zelikman} leveraged a gold response-conditioned rationalization of the CoT and fine-tuned the model to learn this reasoning behavior. The notion of rationalization as a latent variable modeling problem was previously explored in ELV \citep{ELV}, under the framework of labeling examples with explanations through a variational EM algorithm. More recent approaches also treat chain-of-thought as a latent variable that may be trained over rather than purely to be induced at inference-time. TRICE \citep{TRICE} casts intermediate reasoning steps as latent codes, training the model to marginalize over them so that it internally develops coherent reasoning trajectories, through a Markov Chain Monte Carlo (MCMC) EM algorithm. LaTRO \citep{chen2024languagemodelshiddenreasoners} demonstrated that models can self-reward latent reasoning paths — generating candidate thought sequences, scoring them by task success, and reinforcing the most effective ones -- through a variational framework. Concurrent work introduced BoLT \citep{ruan2025reasoninglearnlatentthoughts},  showing that leveraging these implicit traces as supervision leads to gains in data efficiency and performance for continued pre-training on complex reasoning benchmarks by converting latent chain-of-thought into a self-supervised learning signal. 
