Generate-Feedback-Refine: How Much Does Model Quality in Each Role Matter?

Published: 06 Mar 2025, Last Modified: 06 Mar 2025DL4C @ ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 9 pages)
Keywords: feedback, refinement, scalable oversight
TL;DR: We evaluate models of different strengths in each role in the Generate-Feedback-Refine pipeline
Abstract: From early in grade school, people learn from explicit feedback provided in response to assignments or other interactions. In this work, we explore how effectively language models incorporate textual feedback, focusing on exploring the utility of having weaker models feedback stronger ones, a potential pathway to scalable oversight. Using code generation as a test domain, we experimentally investigate a generate-feedback-refine process, varying model strengths for generation, feedback, and refinement across the MBPP, APPS, and DS-1000 datasets. We find that weaker models can provide feedback as effectively as stronger models in some cases. Feedback-and-refinement consistently improves performance on APPS and DS-1000, while on MBPP, feedback mainly benefits weaker generation models, underscoring differences across tasks.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Presenter: ~Jason_Phang1
Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Submission Number: 51
Loading