Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment from Heterogeneous Rewards

Xia Zeng; Yihan Chen; Luhui Liu; Chao Luo; Ye Chen; ZHUANGZHUORAN

Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment from Heterogeneous Rewards

Xia Zeng, Yihan Chen, Luhui Liu, Chao Luo, Ye Chen, ZHUANGZHUORAN

Published: 18 Apr 2026, Last Modified: 26 Apr 2026ACL 2026 Industry Track OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, LLM alignment, post-training, controllable text generation

Abstract: We deploy large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs). The agent must follow a multi-stage Standard Operating Procedure (SOP) and strict guardrails (no over-promising and no hallucinations), while remaining human-like and effective over long, multi-turn dialogues. We propose Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training method that combines heterogeneous rewards: a preference-trained reward model (RM), an LLM-as-a-judge (RJ) for nuanced behaviors (e.g., emotional value and SOP compliance), and rule-based reward functions (RF) (mainly regex-based) for deterministic checks on numerics, formatting, and guardrails. In expert consensus evaluation (three human experts; 30 online conversations and 45 curated bad cases), REPO improves average dialogue rating to 4.63 (+0.33 over GRPO) and raises the share of conversations with at least one excellent response to 66.67\% (+23.34 pp over GRPO), while achieving a 93.33\% bad-case fix rate with 75.56\% clean fixes. In a production A/B test on 9{,}653 real customer conversations (vs.\ an intent-driven dialogue system), REPO improves response rate by +12.14 pp and task success rate by +5.94 pp ($p<0.001$).

Submission Type: Deployed

Copyright Form: pdf

Submission Number: 117

Loading