RLSF: Reinforcement Learning via Symbolic Feedback

Piyush Jha; Prithwish Jana; Pranavkrishna Suresh; Arnav Arora; Vijay Ganesh

RLSF: Reinforcement Learning via Symbolic Feedback

Piyush Jha, Prithwish Jana, Pranavkrishna Suresh, Arnav Arora, Vijay Ganesh

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Symbolic Feedback, Reinforcement Learning, Large Language Models, Program Synthesis

TL;DR: RLSF is a new fine-tuning paradigm that improves domain-specific understanding in LLMs by using symbolic tools for fine-grained feedback, surpassing traditional methods and enabling smaller models to outperform much larger closed-source models.

Abstract: Reinforcement Learning with Human Feedback (RLHF) is considered a standard approach to fine-tuning Large Language Models (LLMs) but faces challenges such as unsound reward models, sparse rewards, and difficulties in collecting human preference data, limiting its effectiveness for complex, domain-specific tasks. We propose Reinforcement Learning via Symbolic Feedback (RLSF), a novel fine-tuning paradigm where reasoning tools (e.g., solvers, provers, algebra systems) serve as the RL environment, providing fine-grained feedback via \textit{poly-sized certificates} (e.g., proofs) that characterize errors in the LLM-generated object with respect to specific correctness criteria. RLSF aims to improve the domain-specific understanding of LLMs more effectively than traditional reward signals. By enabling token-level corrections without requiring differentiable reasoning systems, RLSF addresses key limitations of traditional reward models. Via extensive evaluations, we show that our RLSF-based fine-tuning of LLMs outperforms traditional approaches on five different applications (that have some associated logical or domain constraints), namely, program synthesis from natural language pseudo-code to programming language (+31.43\% in functional correctness for Google's CodeGemma-2b compared to supervised fine-tuning, +17.01\% in functional correctness compared to GPT-3.5 -- 100$\boldsymbol\times$ larger), three chemistry tasks (+5.5\% exact match for molecule generation, +19.4\% exact match for forward synthesis, +33.7\% exact match for retrosynthesis, using Meta's Galactica-1.3b, compared to GPT-4 -- 1000$\boldsymbol\times$ larger), and solving the Game of 24 (+25\% success rate using Meta's Llama2-7b compared to traditional methods, and +7\% success rate compared to GPT-3.5 -- 25$\boldsymbol\times$ larger). A takeaway is that fine-tuning via RLSF enables relatively smaller LLMs to significantly outperform closed-source models that are orders of magnitude larger (e.g., GPT-4).

Supplementary Material: zip

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8060

Loading