COALA: Convex Optimization for Alignment and Preference Learning on a Single GPU

20 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: preference learning, fine-tuning LLMs, single GPU, resource-constrained, convex neural networks
TL;DR: Using convex reformulation of neural networks for preference fine-tuning of LLMs on a single GPU.
Abstract: Fine-tuning large language models (LLMs) to align with human preferences has driven the success of systems like ChatGPT and Gemini. However, methods like Reinforcement Learning from Human Feedback (RLHF) remain computationally expensive and complex. Direct Preference Optimization (DPO) offers a simpler alternative but has limitations such as inconsistent ranking accuracy, reliance on a frozen reference model, and high dependence on expensive GPU resources. We introduce COALA, a novel lightweight algorithm that leverages the convex optimization reformulation of neural networks (cvxNN). By exploiting the expressiveness of cvxNN, COALA eliminates the need for a reference model and obtains significant reduction in both training time and VRAM consumption, thus enabling efficient training on a single GPU. Experiments across three datasets—including a 23,228-sample synthetic educational feedback dataset—and five models (including LLaMA-8B) demonstrate that COALA outperforms traditional preference alignment methods such as DPO and ORPO. COALA exhibits stable, monotonically increasing rewards and reaches peak margins in significantly shorter time in comparison to competitors. To the best of our knowledge, this is the first time convex optimization has been successfully applied to preference fine-tuning of LLMs.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 24477
Loading