# A Fine-Grained Analysis of Pure Semantic Preference Alignment in Large Language Models

This is the source code to reproduce the experiments of the paper "[A Fine-Grained Analysis of Pure Semantic Preference Alignment in Large Language Models]"

Large language models (LLMs) are typically aligned with human preferences through methods such as direct preference optimization (DPO). While empirically successful, these approaches face well-known limitations, including length bias, reward hacking, binary preference assumptions, and the aggregation of heterogeneous preferences into a single scalar signal. In this work, we take an inverse perspective: rather than attempting to resolve these issues, we investigate an idealized setting, which we call the *pure semantic preference scenario*, where such confounding factors are absent. We show that even in this idealized setting, existing alignment methods still do not fully capture the preference. Our analysis further reveals that (i) on-policy algorithms align more effectively, (ii) models trained without an explicit reference model perform better, and (iii) preference-model–based approaches consistently outperform reward-model–based approaches. Motivated by these observations, we introduce *preference matching optimization* (PMO), a DPO-type method that admits a closed-form solution and provably better approximates the true preference distribution. Experiments on both practical and idealized settings demonstrate that PMO achieves comparable performance with existing alignment methods in the practical setting, while offering stronger theoretical grounding and better performance in the pure semantic setting.

## Dependencies

To run the code, please install all its dependencies:
```sh
cd OpenRLHF
pip install -e .
```

## Repository content

This repository is heavily based on the code from OpenRLHF and follows the same structure.
@article{hu2024openrlhf,
  title={OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework},
  author={Jian Hu and Xibin Wu and Zilin Zhu and Xianyu and Weixun Wang and Dehao Zhang and Yu Cao},
  journal={arXiv preprint arXiv:2405.11143},
  year={2024}
}

### PMO
```sh
bash pmo.sh
```
We build our model on DPO code. 
Find our PMO trainer at DPOPMTrainer in trainer/dpo_trainer_pm.py
Prob dataset in datasets/prob_dataset.py


## Datasets
We introduce a preference dataset tailored to isolate semantic choices while removing confounds such as length, formatting, or discourse structure. Each instance is a minimal pair \citep{warstadt2020blimp} built from a single prompt and two completions that differ only by one lexical item in the same position (e.g., “cola” vs. “pepsi”, “tea” vs. “coffee”). Unlike conventional binary preference data used in RLHF \citep{rafailov2023direct,li2023reinforcement}, we attach a soft target—the probability that one completion is preferred over the other—explicitly provided in the dataset.

We upload our dataset in a separate folder.

