Harnessing the Power of Large Language Models for Natural Language to First-Order Logic Translation

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: general machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Large Language Models, Natrual Language to First-Order Logic Translation
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We collect and verfiy a large NL-FOL pair dataset from GPT4; we finetune LLaMA-7B/13B models on this dataset which yields GPT-4 level performance
Abstract: Translating natural language sentences to first-order logic (NL-FOL translation) remains a critical task in many logic-based NLP systems, as it enables ML models to reason logically over text. However, existing translation methods still struggle to scale to real-world tasks due to the lack of a large and high-quality dataset and a model family with high precision and coverage. In this work, we approach this longstanding challenge by harnessing the power of pre-trained large language models (LLMs). To do so, we present MALLS (large language **M**odel gener**A**ted N**L**-FO**L** pair**S**), a dataset of 28K diverse and verified sentence-level NL-FOL pairs collected from GPT-4. We create MALLS by implementing an adaptive pipeline that prompts GPT-4 for pairs with rich and diverse contexts. To ensure the validity of FOL rules and their alignment with the NL sentences, we utilized a combined strategy of FOL rule parsing, human annotation, and automatic filtering. We also present LogicLLaMA, an LLaMA2-7B/13B model family fine-tuned on MALLS for NL-FOL translation. LogicLLaMA is capable of directly translating natural language into FOL rules, which outperforms GPT-3.5. LogicLLaMA is also equipped to correct FOL rules predicted by GPT-3.5, and can achieve similar performance as GPT-4 with a fraction of the cost. This correction ability was achieved by a novel reinforcement learning with human feedback (RLHF) framework, which initially trains on synthetically perturbed NL-FOL pairs to encourage chain-of-thought reasoning and then fine-tunes with RLHF on GPT-3.5 outputs using a FOL verifier as the reward model. Codes and data are available [here](https://www.dropbox.com/sh/t0f69776773e9er/AABKaWuvepUvhSp-0u2w-b2Pa?dl=0).
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: pdf
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3948
Loading