RL-ADA: Co-Evolutionary Adversarial Training for  Self-Improving Customer Support Agents Without Human Feedback

Ram Narayanan Ananthakrishnapuram Sampath; Harshit Rajgarhia; Abhishek Mukherji

RL-ADA: Co-Evolutionary Adversarial Training for Self-Improving Customer Support Agents Without Human Feedback

Ram Narayanan Ananthakrishnapuram Sampath, Harshit Rajgarhia, Abhishek Mukherji

Published: 23 May 2026, Last Modified: 23 May 2026ACM CAIS 2026: RLEval Workshop OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, adversarial training, co-evolutionary learning, task-oriented dialogue, world feedback, tool routing, asymmetric self-play, automated evaluation

TL;DR: RL-ADA co-evolves a Customer Support Agent and an adversarial Customer Agent through opposing reward signals, eliminating human annotation by grounding training entirely in measurable conversation outcomes.

Abstract: Task-oriented dialogue agents require robust tool routing to handle unpredictable user interactions, but training them to withstand adversarial inputs is bottlenecked by the high cost of human preference annotation. Standard self-play methods are poorly suited for the inherently asymmetric nature of customer-agent dialogues. We present RL-ADA, a framework for asymmetric co-evolutionary adversarial training that replaces human labels with world feedback: consequence-based signals derived directly from interaction outcomes. A Customer Support Agent (DA, 3B parameters) and an Adversarial Customer Agent (CA, 7B parameters) train against each other in an adversarial arena guided by a fixed automated judge, with an isolation gym selectively retraining the weaker agent on prior-failure transcripts. The size asymmetry is intentional: the CA faces the harder generative task of producing contextually misleading natural language to elicit misroutes from a smaller, task-focused DA. Training converges autonomously under a win-rate stopping criterion, with the framework requiring no human annotation at any stage. We observe a non-monotonic win-rate trajectory consistent with genuine co-evolutionary pressure, and the emergence of adversarial Contextual Camouflage: the CA learns to embed intent within dense, realistic customer detail rather than expressing it directly, a strategy that arises purely from reward pressure without explicit specification.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 19

Loading