Keywords: Mechanism Design, RLHF, Game Theory, Incentive Compatibility
Abstract: Fine-tuning large language models (LLMs) to aggregate multiple preferences has attracted considerable research attention.
With aggregation algorithms advancing, a potential economic scenario arises where fine-tuning services are provided to agents with different preferences.
In this context, agents may benefit from strategically misreporting their preferences, which could affect the fine-tuned outcomes.
This paper addresses such incentive issues by framing it as a mechanism design problem: an LLM provider determines the fine-tuning objective (training rule) and the pricing scheme (payment rule) for agents.
We primarily focus on a representative class of training rules that maximize social welfare subject to certain regularizations, referred to as \tr\ rules.
First, we show that under most circumstances, truthful reporting is sub-optimal with simply a training rule, thereby highlighting the necessity of payments.
Second, we design affine maximizer payment rules that implement \tr\ rules in dominant-strategy incentive compatibility (DSIC).
Further, we characterize sufficient conditions for payment equivalence properties.
For a training rule that satisfies these conditions, we have found all the payment rules that implement it in DSIC, as they only differ by a constant term irrelevant to agents' reports from each other.
Submission Number: 53
Loading