BalancedBio: Mitigating the Alignment Tax in Biomedical LLMs via Gradient Orthogonality and GRPO

ACL ARR 2026 January Submission8908 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-Objective Optimization, Biomedical Large Language Models
Abstract: Aligning Large Language Models (LLMs) for specialized domains presents a fundamental optimization challenge: the ``alignment tax.'' In biomedicine, this manifests as a conflict between the need for rigorous, encyclopedic factual accuracy and the requirement for flexible, user-friendly instruction following. Existing methods, primarily relying on Supervised Fine-Tuning (SFT) or standard Reinforcement Learning from Human Feedback (RLHF), often fail to navigate this Pareto frontier, resulting in models that are either knowledgeable but rigid, or chatty but hallucination-prone. In this paper, we propose BalancedBio, a novel alignment framework that explicitly models and optimizes conflicting objectives. While utilizing standard high-quality instruction data, our contribution lies in the algorithmic innovation: (1) We introduce Capability-Aware Group Relative Policy Optimization (GRPO), which eliminates the need for a value network and reduces gradient variance in high-entropy reasoning tasks; (2) We propose aDynamic Hybrid Reward Mechanism that adaptively balances domain correctness, reasoning validity, and format compliance during training; (3) We provide a theoretical analysis demonstrating how our method enforces gradient orthogonality to mitigate catastrophic forgetting. Extensive experiments on BIOMED-MMLU, MedQA, and IFEval show that BalancedBio-7B achieves state-of-the-art performance, surpassing Med-PaLM-7B by 6.3\% in domain tasks while maintaining robust general instruction-following capabilities. We will release our model and partial data.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Language Modeling, Clinical and Biomedical Applications
Languages Studied: EN
Submission Number: 8908
Loading