Abstract: Large language models have demonstrated various abilities, i.e. Chain-of-Thought reasoning for Math Reasoning datasets. Can models learn to self-improve these skills? First, we statistically analyzed the potential of the self-evaluation ability of language models. Then, we present a novel self-tuning framework, STC, that leverages reinforcement learning to enhance reasoning capabilities in large language models. STC encourages the generation of logical explanations by evaluating the greedy decoded responses against the diverse sampled responses. Results highlight the effectiveness of our framework across various model sizes (1B-20B). We observe improvements in the accuracy of up to 5\% on four different math reasoning datasets, simultaneously improving commonsense ability and retaining language understanding ability. Additionally, human and machine evaluation confirms the quality of the generated responses became more detailed and logical after training.
Paper Type: long
Research Area: Generation
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.
0 Replies
Loading