Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization

Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization

ACL ARR 2026 January Submission8441 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language models;Preference alignment;Topological data analysis;Persistent homology

Abstract: Alignment of large language models (LLMs) typically relies on supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), or more recently direct preference optimization (DPO). However, existing objectives largely ignore the global geometry and topology of the representation space: they operate on local token-level likelihoods or scalar preference scores, and do not explicitly constrain how hidden states move from a user prompt to an answer. We view generation as tracing a \emph{semantic trajectory} in hidden space, and propose a topology-enhanced alignment framework that regularizes these trajectories using $0$-dimensional persistent homology. First, at the SFT stage, we introduce a \textbf{Trajectory Topology Loss} (TTL). For each batch, we treat mean-pooled embeddings of prompts and gold answers as a mixed point cloud, run a Union-Find-based $0$D persistent homology algorithm, and extract ``prompt--answer bridge'' edges that connect previously disconnected components. TTL encourages the model's actual update direction from prompt to answer to align with these topologically derived bridges, rather than with arbitrary or per-example directions. Second, at the RLHF/DPO stage, we propose \textbf{Topological Preference Optimization} (TPO). TPO constructs topic-specific semantic preference vectors from an offline pipeline and aligns the semantic improvement direction between rejected and chosen responses with these vectors in an intermediate hidden layer. We further introduce an exponential-moving-average-based dynamic weighting scheme to balance DPO and TPO losses, and also explore a fully topological variant that applies persistent homology on the chosen/rejected embedding cloud. We instantiate our methods on Qwen2.5-7B-Instruct and evaluate on UltraChat and Anthropic HH-RLHF. Across both SFT and DPO training, topology-enhanced objectives consistently outperform strong non-topological baselines (including per-example, nearest-neighbor, and random direction regularizers) on automatic preference metrics and LLM-judge evaluations, while maintaining or slightly improving toxicity. These results suggest that incorporating persistent homology and trajectory geometry is a promising and practical direction for more controllable LLM alignment.

Paper Type: Long

Research Area: AI/LLM Agents

Research Area Keywords: fine-tuning, reinforcement learning, representation learning, robustness, probing

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 8441

Loading