Keywords: learning from textual feedback, large language model, instruction finetuning, reinforcement learning from human feedback
Abstract: Finetuning pre-trained language models (LMs) enhances the models' capabilities.
Prior techniques fine-tune a pre-trained LM on input-output pairs (e.g., instruction fine-tuning) or with numerical rewards that gauge the quality of its outputs (e.g., reinforcement learning from human feedback).
We explore LMs' potential to learn from textual interactions (LETI) that not only check their correctness with binary labels but also pinpoint and explain errors in their outputs through textual feedback.
Our investigation focuses on the code generation task, where the model produces code in response to natural language instructions.
This setting invites a natural and scalable way to acquire textual feedback: the error messages and stack traces from code execution using a Python interpreter.
LETI iteratively fine-tunes the model, using the LM objective, on a concatenation of natural language instructions, LM-generated programs, and textual feedback, which is only provided when the generated program fails to solve the task.
Prepended to this fine-tuning text, a binary reward token is used to differentiate correct and buggy solutions.
LETI requires no ground-truth outputs for training and even outperforms a fine-tuned baseline that does.
LETI not only improves the performance of two base LMs of different scales on a code generation dataset MBPP, but also generalizes to other datasets.
Trained on MBPP, it achieves comparable or better performance than the base LMs on unseen problems in HumanEval.
Furthermore, compared to binary feedback, we observe that textual feedback leads to improved generation quality and sample efficiency, achieving the same performance with fewer than half of the gradient steps.
LETI is equally applicable in natural language tasks when they can be formulated as code generation, which we empirically verified on event argument extraction.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3930
Loading