Abstract: This paper identifies the misinterpretation of the context can be a significant issue during the reasoning process of large language models, spanning from smaller models like Llama3.2-3B-Instruct to cutting-edge ones like DeepSeek-R1.
For example, in the phrase "10 dollars per kilo," Llama3.2-3B-Instruct might not recognize that "per" means "for each," leading to calculation errors.
We introduce a novel, post-training approach called **Stick to the Facts (SIFT)** to tackle this.
SIFT leverages increasing inference-time compute to ground LLM reasoning in contexts.
At the core of SIFT lies the *Sticker*, which is manipulated by the model itself to explicitly emphasize the key information within the context.
Given the curated Sticker, SIFT generates two predictions---one from the original query and one from the query augmented with the Sticker.
If they differ, the Sticker is sequentially refined via *forward* optimization (to better align the extracted facts with the query) and *inverse* generation (to conform with the model’s inherent tendencies) for more faithful reasoning outcomes.
Studies across diverse models (from 3B to 100B+) and benchmarks (e.g., GSM8K, MATH-500) reveal consistent performance improvements.
Notably, SIFT improves the pass@1 accuracy of DeepSeek-R1 on AIME2024 from 78.33% to 85.67%, establishing a new state-of-the-art in the open-source community.
The code will be public after acceptance.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: Factual Consistency, Prompting Engineering
Languages Studied: English
Submission Number: 7997
Loading