Prompt Optimization with Logged Bandit Data

Haruka Kiyohara; Yuta Saito; Daniel Yiming Cao; Thorsten Joachims

Prompt Optimization with Logged Bandit Data

Haruka Kiyohara, Yuta Saito, Daniel Yiming Cao, Thorsten Joachims

Published: 04 Mar 2024, Last Modified: 02 May 2024DPFM 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: prompt optimization, naturally logged data, contextual bandits, off-policy learning

TL;DR: We study how to use naturally available user feedback for prompt optimization, leveraging similarity among generated sentences.

Abstract: We study how to use naturally available user feedback, such as clicks, to optimize a prompt policy for generating sentences with large language models (LLMs). Naive approaches, including regression-based and importance sampling-based ones, suffer either from biased log data or variance caused by the large action space of prompt. To circumvent these challenges, we propose a way to leverage similarity and smoothness in the (generated) sentence embedding space, substantially reducing variance in the policy gradients while maintaining a small bias. Initial experiments on synthetic data demonstrate the effectiveness of our approach. We also plan to publish the extended benchmark and simulator as open-source software.

Submission Number: 5

Loading