RL-ADA: Co-Evolutionary Adversarial Training for Self-Improving Customer Support Agents Without Human Feedback
Keywords: reinforcement learning, adversarial training, co-evolutionary learning, task-oriented dialogue, world feedback, tool routing, asymmetric self-play, automated evaluation
TL;DR: RL-ADA co-evolves a Customer Support Agent and an adversarial Customer Agent through opposing reward signals, eliminating human annotation by grounding training entirely in measurable conversation outcomes.
Abstract: Task-oriented dialogue agents require robust tool routing to
handle unpredictable user interactions, but training them to
withstand adversarial inputs is bottlenecked by the high cost
of human preference annotation. Standard self-play methods are
poorly suited for the inherently asymmetric nature of
customer-agent dialogues. We present RL-ADA, a framework for
asymmetric co-evolutionary adversarial training that replaces
human labels with world feedback: consequence-based signals
derived directly from interaction outcomes. A Customer Support
Agent (DA, 3B parameters) and an Adversarial Customer Agent
(CA, 7B parameters) train against each other in an adversarial
arena guided by a fixed automated judge, with an isolation gym
selectively retraining the weaker agent on prior-failure
transcripts. The size asymmetry is intentional: the CA faces
the harder generative task of producing contextually misleading
natural language to elicit misroutes from a smaller,
task-focused DA. Training converges autonomously under a
win-rate stopping criterion, with the framework requiring no
human annotation at any stage. We observe a non-monotonic
win-rate trajectory consistent with genuine co-evolutionary
pressure, and the emergence of adversarial Contextual
Camouflage: the CA learns to embed intent within dense,
realistic customer detail rather than expressing it directly,
a strategy that arises purely from reward pressure without
explicit specification.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 19
Loading