Mitigating Feedback Inconsistency in Large Language Model Self-Tuning

Anonymous

Mitigating Feedback Inconsistency in Large Language Model Self-Tuning

Anonymous

16 Oct 2023ACL ARR 2023 October Blind SubmissionReaders: Everyone

Abstract: Large language models have demonstrated various abilities, i.e. Chain-of-Thought reasoning for Math Reasoning datasets. Can models learn to self-improve these skills? First, we statistically analyzed the potential of the self-evaluation ability of language models. Then, we present a novel self-tuning framework, STC, that leverages reinforcement learning to enhance reasoning capabilities in large language models. STC encourages the generation of logical explanations by evaluating the greedy decoded responses against the diverse sampled responses. Results highlight the effectiveness of our framework across various model sizes (1B-20B). We observe improvements in the accuracy of up to 5\% on four different math reasoning datasets, simultaneously improving commonsense ability and retaining language understanding ability. Additionally, human and machine evaluation confirms the quality of the generated responses became more detailed and logical after training.

Paper Type: long

Research Area: Generation

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.

0 Replies

Loading