
\section{Self-Taught Principle Learning}\label{sec:algorithm}

We propose STaPLe, a self-improvement mechanism for 1) discovery of principles by the model itself aimed towards response revision and 2) training the model to invoke such principles and subsequently performing response self-refinement (if needed) at inference time. We view these principles as latent reasoning traces that bridge the gap between an initial model response and a reference target.

In the vein of the Self-Taught Reasoner (STaR) \citep{zelikman}, we leverage the gold response as a "hint" to propose principles and guide response refinement decisions. However, our formulation is generic and allows for the use of non-verifiable gold responses as hints. In particular, we use the proximity of the generated response to the reference response as a signal of correctness. Any similarity metric can be used to measure this proximity, as our approach is agnostic to this choice -- the exact match metric used for verifiable responses can be seen as one such instantiation. 

 Given a dataset $\mathcal{D} = \{(x_i,y_i^1,y_i^G)\}_{i=1}^n$, where $y_i^G$ is the gold response and $y_i^1$ is model's initial response for the $i^{th}$ sample, we aim to learn a latent response-improvement reasoning trace $z_i$ such that the probability of producing a response close to the gold reference is maximized. The latent reasoning trace, or \textit{principle}, $z_i$ is also verbalized as natural language, i.e. discrete text tokens from vocabulary $\mathcal{V}^*$. 
%
We implement STaPLe to optimize the following marginal likelihood:
%
\[p(y^G \mid x, y^1) = \sum_{y^2 \in \mathcal{V}^*} \sum_{z \in \mathcal{V}^*} p(y^G \mid x,y^1,z, y^2) \cdot p(y^2, z \mid x,y^1; \theta)\]
%
where $y^2$ is a model refinement of the initial response $y^1$ generated with aid of latent principle $z$. The distribution $p(y^G \mid x, y^1, z, y^2)$ is a fixed, prespecified validator model indicating the likelihood of the current revision $y^2$ matching the gold response $y^G$. We parametrize $p(y^2,z \mid x , y^1)$ by the language model itself, with parameters $\theta$. As shown in Appendix \ref{appendix:mc-em-gradient}, and following the standard latent variable model formulation, the gradient for this objective is:
%
\begin{align}\label{eq-1}
\nabla_\theta \hspace{0.5mm} \mathcal{L}(\theta) &= \mathbb{E}_{p(y^2,z \hspace{0.5mm}\mid \hspace{0.5mm} x,y^1,y^G)} \left\{\nabla_\theta \log p(y^2,z \mid x,y^1; \theta)\right\}
\end{align}
%
This objective can be maximized via Expectation-Maximization (EM). This comprises the repeated application of two alternating stages: 1) a principle discovery stage (E-step) and 2) a principle learning stage (M-step). This is depicted in Figure \ref{fig:main-figure}.

In \textbf{principle discovery stage (E-step)}, we sample $N$ principles $z_{1:N}$ and corresponding responses ${y^2_{(1:N)}}$ from the posterior $p(y^2,z \hspace{0.5mm}\mid \hspace{0.5mm} x,y^1,y^G)$. Approximating the true posterior would require an intractable marginalization over $\mathcal{V}^*$. We approximate this posterior by "hinting" our language model with the gold response\footnote[1]{Here we factorize the approximate posterior into principle generation and response generation terms, where the gold response is only seen by the principle generation terms. This will help avoid the trivially degenerate solution where $y^2$ simply copies $y^G$. We acknowledge that in the absence of any constraint on $z$, this can still lead to copying via $z$. However, in practice, the prompt that elicit the principle creates a contextual bias against exact copy. Moreover, we also experiment with explicit clustering contraints over $z$ and show that both versions of our approach perform similarly.}, and a prompt to elicit a principle. This is represented by: 
%
\[ \tilde{p}(y^2, z \hspace{0.5mm}\mid \hspace{0.5mm} x,y^1,y^G; \theta) = p(y^2 \mid x, y^1, z; \theta) \cdot p(z \mid x, y^1, y^G; \theta)\] 
%
To improve the quality of our samples, we employ a cycle-consistency approach, implicitly defining the true posterior as approximated by:
%
\begin{equation}
p(y^2,z \mid x,y^1,y^G; \theta) \propto p(y^G \mid x,y^1,z, y^2) \cdot \tilde{p}(y^2,z \hspace{0.5mm}\mid \hspace{0.5mm} x,y^1,y^G; \theta)
\end{equation}
%

\begin{figure}
    \centering
    \includegraphics[width=0.7\linewidth]{Figures/principle-discovery-tikz-figure.png}
    \caption{The figure above depicts the \textbf{\textit{principle discovery}} (E-step) phase. We sample an initial response $y^1$ on-policy, then "hint" with the gold response to elicit candidate principles $z_{(1:N)}$. Then, we sample critiques on the initial response (only used in rejection sampling, and not included in the fine-tuning trajectories), which we use to obtain principle-guided refined responses $y^2_{(1:N)}$. The best refined response $\hat{y}^2$ is selected based on similarity to the gold response. We save the resulting trajectory, which is used for supervised fine-tuning in the \textbf{\textit{principle learning}} (M-step) stage. }
    %\footnotetext{These critiques are only used during rejection sampling to induce a principle-guided refinement, and are not included in the trajectories for fine-tuning.}
    \label{fig:principle-discovery-tikz-figure}
\end{figure}

This can be seen as equivalent to hinted CoT generation as in STaR, whereby samples that score higher in reconstruction error are assigned higher probability. In practice we use a sparse approximation of this distribution, that assigns zero probability unless there is an improvement in similarity function $f$:

\[p(y^G \mid x, y^1,z,y^2) \propto \begin{cases}
    f(y^2,y^G) \hspace{0.5mm}-\hspace{0.5mm} f(y^1,y^G),& \text{if } f(y^2,y^G) \hspace{0.5mm}>\hspace{0.5mm} f(y^1,y^G)\\
    0,              & \text{otherwise}
\end{cases}\]

%
We sample from $\tilde{p}(y^2,z \mid x, y^1, y^G; \theta)$ via rejection sampling. Given a sample $y_n \sim \tilde{p}(y^2,z \mid x,y^1; \theta)$, we accept it with probability $p_n = \frac{p(y_n,z \hspace{0.5mm}\mid\hspace{0.5mm} x,y^1,y^G; \theta)}{M \cdot \tilde{p}(y_n, z \hspace{0.5mm}\mid\hspace{0.5mm} x,y^1, y^G;\theta)}$
we include a derivation of this rejection sampling rule in Appendix \ref{appendix:rejection-sampling-rule}.

We also compare the initial response $y^1$ to the gold reference $y^G$ using the similarity metric; if they are sufficiently close, we accept the response without further refinement and without sampling a $\hat{z}$. The principle discovery stage yields a principle-augmented dataset $(x \cup y_1, \hat{z}, y_2) \in \mathcal{D}'$. Note that if no refinements improve upon the initial generation relative to the gold response, we discard the sample; thus, the dataset $\mathcal{D}'$ only consists of those samples on which a principle improved the quality of the response towards the gold. 

In the \textbf{principle learning stage (M-step)}, we use the data $\mathcal{D}'$ collected in the principle discovery stage for supervised fine-tuning of the language model. In particular, we train the model to maximize the log-likelihood of the refinement trajectories in $\mathcal{D}'$. The corresponding EM update can be written as:
\[\theta^{(t+1)} = \arg\max\limits_{\theta} \mathbb{E}_{(x,y^1,\hat{z},{{\hat{y}}^2}) \in \mathcal{D}'}[\log p(y^2,z|x,y^1; \theta]\]

This should qualitatively result in the fine-tuned LM being able to invoke principles conditioned on a prompt and learning to produce high-quality responses conditioned on both the prompt and the invoked principle. 

The two stages can be repeated multiple times, achieving incremental improvements till no further gains are seen with respect to the gold references. We also draw a connection 
between STaPLe (this EM procedure) and variance-reduced self-play; this is discussed further in Appendix \ref{thrm-1-proof}.

\subsection{Posterior Regularization via Clustering}\label{sec:pr-clustering}

To maximize the human interpretability of principles and their application relative to specific domains, it is beneficial to have a compressed set, or \textit{constitution} to distill to the model. 
However, the E-step  described above, results in thousands of unique principles. We seek to project this set into a constrained subspace where the resulting principles serve as representatives for desirable attributes to be reflected. This can be achieved via posterior regularization (PR) in latent variable modeling.   
For a posterior constraint set $\mathcal{Q}$, the canonical posterior regularization framework solves the problem
 \[q^*(z) = \arg\min\limits_{q \in \mathcal{Q}} KL(q(z|x,y^*) || p(z|x,y^*))\]

From \cite{pr-latent-var-models}, we obtain that the primal solution is given by:
\[\tilde{p}(y^2,z \mid x,y^1,y^G) \propto p(y^2,z \mid x,y^1,y^G) \cdot \exp{(-\lambda g(z))}\] where $g(z)$ denotes the constrained features of the principles and $\lambda$ is a Lagrange multiplier that must be set such that the expected value of the features under $\tilde{p}$ respects the constraints. 

Consider the following definition of the constraints: assume access to a clustering algorithm which yields a set of clusters $\{C_1, C_2, \dots C_K\}$. For each cluster, a representative element $\tilde{z}$ is chosen, forming the set $\tilde{Z} = \{\tilde{z}_1, \dots, \tilde{z}_K\}$. Now define $g_k(z) = \textbf{1}(z \in C_k \setminus \{\tilde{z}_k\})$ for $k \in [1,K]$ as binary feature functions. Thus, to ensure that the regularized posterior only places mass on the representative elements, we can enforce the constraint set  $\mathcal{Q} = \{q: \mathop{\mathbb{E}}_q[g_k(z)] = 0 \hspace{1mm} \forall \hspace{1mm} k \in [1,K] \}$

However, while an algorithm like projected gradient descent could be used to solve for the Lagrange multipliers, this is expensive for deep neural networks, and as such, is impractical in our case. Instead, we suggest that performing clustering methods on a set of posterior samples and retaining only the representative elements fulfills an equivalent role empirically. Clustering serves to consolidate principles that are lexically close, and leveraging an embedding model for distances allows for semantic awareness in merging similar elements. 

In particular, we consider hierarchical (agglomerative) clustering for several of its benefits: (1.) it requires no assumptions about number of clusters or cluster shape a priori, (2.) the algorithm is deterministic, ensuring that the same clusters would be obtained for a given configuration, and (3.) the algorithm is relatively fast, only taking a few seconds in practice over thousands of principles. To ensure that the clustering is performed in a semantics-aware manner, we first obtain a sentence embedding with an encoder-only model and perform clustering over these embeddings. 

Given the principle-augmented dataset $(x_i, y^1_i, \hat{z}_i, y^2_i) \in \mathcal{D}'$ and a set $\tilde{Z}$ of cluster representative elements over $\mathcal{C} = \{C_i\}_{i=1}^k$, we aim to replace $\hat{z_i}$ with the element $\widetilde{z} \in \widetilde{Z}$ that is closest in meaning to the original principle. Qualitatively, we want the set $\widetilde{Z}$ to comprise the human-readable constitution, minimizing semantic overlap in its labels. We take the medoid as the cluster representative:

\begin{equation}
    \widetilde{Z}_{medoid} = \{m_k: m_k = \arg\min\limits_{m \in C_k} \sum_{j \in C_k} ||e_i - e_j||_2, \hspace{0.5mm} k \in [1,K]\}
    \tag{Medoid Representatives}
\end{equation}

It suffices to retrieve the corresponding cluster $C_i$ for a sample $i$ and replace $\hat{z}_i$ with $\tilde{z}_i \in \widetilde{Z}_{medoid}$. The resulting dataset from this augmentation, $(x_i, y^1_i, \tilde{z}_i, y^2_i) \in \widetilde{\mathcal{D}}$ is then used to train the model. 