A Domain-Specific Post-Hoc Approach to Address the Failure of Platt Scaling in LLM Calibration

A Domain-Specific Post-Hoc Approach to Address the Failure of Platt Scaling in LLM Calibration

ACL ARR 2025 July Submission126 Authors

23 Jul 2025 (modified: 04 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The reliable deployment of trustworthy AI systems hinges upon precise model calibration. While LLM capabilities advance, a deeper empirical understanding of their calibration under diverse conditions and varying task demands, subjected to multiple choice questions, remains essential. This paper presents a comprehensive analysis of LLM calibration across multiple architectures and a spectrum of multiple choice questions in different domains. Our systematic investigation reveals that standard calibration techniques, including widely used temperature scaling and Platt Scaling, often show inconsistent efficacy across different models and different knowledge domains, underscoring the need for more adaptive calibration strategies. As part of this broad investigation, we introduce and evaluate Normalized Multiple Choice Platt Scaling (**NMPS**). This lightweight, post-processing technique is highly efficient, requiring no LLM fine-tuning and adding negligible computational overhead during inference. Our experiments demonstrate that this approach offers a substantial improvement over existing methods; it reduces the mean calibration error across our test suite by nearly 12%, whereas standard Platt Scaling shows detrimental, increasing the error to 145%. This work thus provides two key contributions: an effective, non-invasive calibration method and crucial insights into domain-dependent model reliability, offering a practical roadmap for developing more trustworthy AI systems.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Calibration/uncertainty, Safety and alignment, Question Answering, Robustness

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Surveys

Languages Studied: English

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

Software: zip

Data: zip

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: No

A2 Elaboration: We do not incur any potential risks in our work. We aim to mitigate potential risks via model calibration.

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Yes, the creators of the models, datasets, and tools used are cited in Section 5.1 (Experimental Setup) and Appendix A (Acknowledgement of Artifacts).

B2 Discuss The License For Artifacts: No

B2 Elaboration: No, the artifact licenses permit reproduction and distribution for academic use.

B3 Artifact Use Consistent With Intended Use: No

B3 Elaboration: No, the use of all artifacts are under their original intended use.

B4 Data Contains Personally Identifying Info Or Offensive Content: No

B4 Elaboration: No, we do not use any data that contain personally identifying or offensive content.

B5 Documentation Of Artifacts: No

B5 Elaboration: No, our code is only running inference of others' models and datasets, thus the detailed document should be referred to the original artifacts.

B6 Statistics For Data: Yes

B6 Elaboration: Yes, the paper details the calibration/validation/test splits for the MMLU benchmark, designating an 80% split for the calibration set and a 20% split for validation. Other benchmarks were used as a pure test set. This is discussed in Section 5.4 (Implementation Details).

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Yes, the paper reports the number of parameters for the models used (e.g., Llama-3.2-3B, Phi4-14B, Qwen3-32B, s1.1-32B) and states that experiments were conducted on a system equipped with 2x NVIDIA A100 GPUs. This information is found in Section 5.1 (Experimental Setup) and Section 5.4 (Implementation Details). The total computational budget in GPU hours is not explicitly reported

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Yes, the experimental setup is discussed in Section 5.4 (Implementation Details). This includes details on data handling, logit extraction, and calibration method setup. For Temperature Scaling, the optimal temperature was determined via a grid search. For Platt Scaling and NMPS, parameters were optimized using the L-BFGS-B algorithm.

C3 Descriptive Statistics: Yes

C3 Elaboration: Yes, the paper reports descriptive statistics including mean accuracy and mean calibration error across all models and tasks in Table 1. Table 2 provides the variance and standard deviation for these metrics. Figure 2 visually represents mean performance with error bars indicating standard deviation. This demonstrates that aggregated results are reported, not single runs.

C4 Parameters For Packages: Yes

C4 Elaboration: Yes, the paper reports the use of lm-evaluation-harness (v0.4.1), PyTorch, and HuggingFace Transformers libraries. It details the experimental setup and parameter optimization for calibration methods in Section 5.4 (Implementation Details), including the grid search range for Temperature Scaling and the L-BFGS-B algorithm for Platt Scaling and NMPS. The Qwen1.5B-Instruct model is also specified as the categorizer LLM.

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D1 Elaboration: N/A, as no human participants or annotators were involved in this research.

D2 Recruitment And Payment: N/A

D2 Elaboration: This is not discussed as the research exclusively uses existing, publicly available benchmark datasets (MMLU, BigBench) and does not involve recruiting or paying human participants.

D3 Data Consent: N/A

D3 Elaboration: This is not discussed as the research exclusively uses existing, publicly available benchmark datasets (MMLU, BigBench) and does not involve collecting or curating new data from human subjects. Therefore, data consent from individuals was not applicable to this work.

D4 Ethics Review Board Approval: N/A

D4 Elaboration: This is not applicable as the research does not involve human subjects or the collection of new data from them. The study relies solely on existing, publicly available benchmark datasets.

D5 Characteristics Of Annotators: N/A

D5 Elaboration: N/A, as no human annotators or participants were involved in this research. The study relies solely on existing, publicly available benchmark datasets, and data categorization was performed by an LLM.

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: Yes

E1 Elaboration: Yes, the use of Google Gemini for domain categorization is detailed in Section 4.2 (Adaptive Calibration Process).

Author Submission Checklist: yes

Submission Number: 126

Loading