Abstract: Large language models (LLMs) can produce overconfident and factually unsupported answers, limiting their reliability for tasks that demand faithfulness to provided evidence.
Softmax tempering, which is multiplying pre-softmax logits by a temperature $T$ at training time, was originally used for knowledge distillation, then for
offering a simple approach to improve both confidence calibration and factual consistency.
In this paper, we provide (1)~a structured literature review of softmax tempering in Transformer-based models; (2)~an empirical study using \model, comparing tempered fine-tuning against standard fine-tuning on SQuAD v2 and a new dataset, PolyCompQA, which contains QA pairs based on polymer composite literature tables. Our experiments reveal that moderate temperatures (e.g., $T=1.67$) reduce hallucinations and improve calibration metrics, with minimal implementation overhead.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: softmax tempering, tableqa
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Theory
Languages Studied: English
Submission Number: 8375
Loading