Abstract: Language model training and alignment rely on high-quality human feedback, yet platforms must incentivize valuable contributions while limiting harmful feedback from non-experts. We study a screening environment in which a platform commits to a uniform policy $(\rho,R,P)$---a verification rate, a reward for submitting feedback, and a sanction imposed when verified feedback is harmful---and heterogeneous users decide whether to participate. High-type users are more likely to produce helpful feedback, while low-type users are more likely to generate harmful feedback, and user types may differ in their effective exposure to sanctions.
We characterize platform-optimal verification, reward, and sanction policies under costly verification in robust pure-participation regimes. A key boundary condition,
$\phi_H(1-\eta_H)=\phi_L(1-\eta_L)$, separates parameter regions in which incentives implement normal separation, where high types participate and low types abstain, from regions exhibiting reverse screening, where low types participate while high types are deterred. Verification is the primary policy margin: it improves the value of screened feedback and enables sanctions, but it is costly because aggregate verification costs are convex in the mass of verified feedback. Rewards are pinned down by participation constraints, while sanctions are useful only when their expected collections exceed the induced increase in reward compensation and are limited by enforcement and reputational costs.
We further show that, under optimal verification, platform profit need not increase monotonically with population quality. In our numerical illustration, this non-monotonicity appears as an inverted-U pattern, with profit peaking at intermediate shares of high-type users. The mechanism is that a larger high-type population changes the platform's optimal verification intensity, and the resulting adjustment in verification benefits and costs can make additional high-type participation less profitable at the margin. Finally, we provide an illustrative simulation using a bigram language model as a transparent calibration exercise to generate plausible magnitudes for $(\eta_H,\eta_L)$ and to visualize the model's comparative statics.
Submission Type: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=p14z2pffFl¬eId=p14z2pffFl
Changes Since Last Submission: Revise the author metadata
Assigned Action Editor: ~Dileep_Kalathil1
Submission Number: 9304
Loading