Non-Determinism of “Deterministic” LLM System Settings in Hosted Environments

Non-Determinism of “Deterministic” LLM System Settings in Hosted Environments

ACL ARR 2025 July Submission375 Authors

27 Jul 2025 (modified: 12 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: LLM (large language model) users of hosted providers commonly notice that outputs can vary for the same inputs under settings expected to be deterministic. While it is difficult to get exact statistics, recent reports on specialty news sites and discussion boards suggest that among users in all communities, the majority of LLM usage today is through cloud-based APIs. Yet the questions of how pervasive non-determinism is, and how much it affects performance results, have not to our knowledge been systematically investigated. We apply five API-based LLMs configured to be deterministic to eight diverse tasks across 10 runs. Experiments reveal accuracy variations of up to 15% across runs, with a gap of up to 70% between the best possible performance and worst possible performance. No LLM consistently delivers the same outputs or accuracies, regardless of task. We speculate about the sources of non-determinism such as input buffer packing across multiple jobs. To better quantify our observations, we introduce metrics focused on quantifying determinism, TARr@N for the total agreement rate at N runs over raw output, and TARa@N for total agreement rate of parsed-out answers. Our code and data will be publicly available at https://github.com/Anonymous.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: evaluation, reproducibility, determinism

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

A2 Elaboration: We do not think our work has any risks

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: 3

B2 Discuss The License For Artifacts: No

B2 Elaboration: they are publicly available and frequently used datasets.

B3 Artifact Use Consistent With Intended Use: No

B3 Elaboration: we only used them for research purposes

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: Yes

B5 Elaboration: our code will be made available

B6 Statistics For Data: Yes

B6 Elaboration: 3

C Computational Experiments: Yes

C1 Model Size And Budget: N/A

C1 Elaboration: we used APIs to run experiments, we do not know the infrastructures they use

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: 5

C3 Descriptive Statistics: Yes

C3 Elaboration: 6

C4 Parameters For Packages: Yes

C4 Elaboration: f

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

Author Submission Checklist: yes

Submission Number: 375

Loading