Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment from Heterogeneous Rewards
Keywords: reinforcement learning, LLM alignment, post-training, controllable text generation
Abstract: We deploy large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs). The agent must follow a multi-stage Standard Operating Procedure (SOP) and strict guardrails (no over-promising and no hallucinations), while remaining human-like and effective over long, multi-turn dialogues.
We propose Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training method that combines heterogeneous rewards: a preference-trained reward model (RM), an LLM-as-a-judge (RJ) for nuanced behaviors (e.g., emotional value and SOP compliance), and rule-based reward functions (RF) (mainly regex-based) for deterministic checks on numerics, formatting, and guardrails. In expert consensus evaluation (three human experts; 30 online conversations and 45 curated bad cases), REPO improves average dialogue rating to 4.63 (+0.33 over GRPO) and raises the share of conversations with at least one excellent response to 66.67\% (+23.34 pp over GRPO), while achieving a 93.33\% bad-case fix rate with 75.56\% clean fixes.
In a production A/B test on 9{,}653 real customer conversations (vs.\ an intent-driven dialogue system), REPO improves response rate by +12.14 pp and task success rate by +5.94 pp ($p<0.001$).
Submission Type: Deployed
Copyright Form: pdf
Submission Number: 117
Loading