Policy optimization to align the validity, coherence and efficiency of reasoning agents in multi-turn dialogues

Jeremy Curuksu

Policy optimization to align the validity, coherence and efficiency of reasoning agents in multi-turn dialogues

Jeremy Curuksu

Published: 22 Oct 2024, Last Modified: 30 Oct 2024NeurIPS 2024 Workshop Open-World Agents PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language models, fine tuning, policy optimization, multi-turn dialogues, generative retrieval, reasoning-acting agents

TL;DR: This paper proposes a method to improve the fidelity and efficiency of language models in multi-turn dialogues by reinforcement learning policy optimization with a deterministic reward model and adversarial prompt generation.

Abstract: Reinforcement learning from human preferences can fine tune language models for helpfulness and safety, but does not directly address the fidelity and efficiency of reasoning agents in multi-turn dialogues. I propose a method to improve the validity, coherence and efficiency of reasoning agents by defining a reward model as a mapping between predefined queries and tools which can be applied to any custom orchestration environment. The reward model is used for policy optimization to fine tune the clarification fallback behavior and help the agent learn when best to ask for clarifications in multi-turn dialogues. This is demonstrated in several orchestration environments where after fine tuning with either proximal policy optimization or verbal reinforcement, the new policy systematically identifies the correct intents and tools in < 2 steps in over 99% of all sampled dialogues.

Submission Number: 52

Loading