Keywords: model values, post-training, direct preference optimization
Abstract: Post-training is immensely important: it is what takes LLMs from next-token predictors to generally useful assistants. However, curation of post-training data is often heuristic and empirical, and its effects mostly understood post-hoc. In this paper, we investigate effects of post-training by examining when and how Olmo-3-7B-Instruct learns its *values*. We first quantify value changes across post-training, finding an increase in safety-related values during SFT but a decrease during DPO. Zooming into DPO, we find that we can predict (Spearman $\rho = 0.741$) changes in values **without training, using only the dataset**, via dot products of activation differences on DPO datapoints with value directions. However, we surprisingly find that most of this value change over DPO is due to Olmo's decreased propensity to refuse; our method is likely just picking up on this simpler latent value. Nevertheless, our results show that we can, to some extent, isolate *where* values change during training and *predict how* they will change from just training data; we are excited about future work that further investigates such questions.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 107
Loading