Artifact Readiness Gates with Saturation Stop Rules and Host-Parity Admissibility for FM Release Evaluation

Published: 28 Mar 2026, Last Modified: 01 Apr 2026AIware 2026EveryoneRevisionsCC BY 4.0
Keywords: AI-powered software, release engineering, trustworthiness, evaluation gates, host parity, saturation stop rules, InvarLock
TL;DR: We present a three-gate FM release-evaluation framework showing that, in this H100/H200 matrix, seed repetition saturated after one pass, edit-family breadth remained informative, and host-parity checks blocked boundary-risk promotions.
Abstract: Release evaluation for FM-powered software often grows by habit rather than policy: teams repeat runs until budget or time is exhausted, without clear evidence that more passes change release decisions. We study a release-evaluation protocol that separates three concerns: artifact readiness, decision-stability stopping, and cross-hardware promotion gating. The study uses 340 runs spanning seven edit families (five core plus two probes), four model families, ten seeds, and dual-host H100/H200 execution. In this matrix and under this policy setting, additional seed repetition did not change promote/block outcomes, edit-family breadth remained decision-informative, and small H100/H200 score differences could still alter promotion outcomes near strict boundaries. These findings motivate workload-conditional resource allocation for release engineering: in this evidence setting, additional budget is more decision-informative when spent on edit diversity and host-parity checks than on deeper seed repetition. The contribution is an operational decision framework, with explicit sensitivity reporting, that turns release evaluation from a fixed checklist into a defensible governance process. In this matrix, seed-stop reduced measured GPU-hours by about 90% versus fixed 10-pass seed evaluation. Numeric thresholds are workload-derived; the transferable contribution is the gate-setting process.
Revision Summary: We revised the manuscript to address the reviews while preserving the paper’s core results and conclusions. The main changes are clarifications of scope, claim framing, and presentation. We now state explicitly that the contribution is a governance-level, auditable release-evaluation protocol rather than a claim about downstream task improvement. We narrowed the strongest empirical claims by scoping them to “this matrix / this policy setting,” recast the budget-allocation guidance as workload-conditional, and narrowed the tool-agnostic statement to interface-level portability in principle without cross-runner validation. We also make the H100/H200 scope of the host-parity evidence explicit earlier in the Results and Discussion. For clarity and readability, we rewrote the contribution bullets, added supporting references in the protocol section, added a terminology box and a clearer protocol-flow figure, clarified the evaluation coverage as five core edit families plus two probe families, and defined the edit families more explicitly in the evaluation design/results text. We expanded the discussion to better separate empirical findings from deployment guidance, softened the novelty claim relative to existing LLMOps practice, consolidated the host-delta presentation into a single table, and moved the order-sensitivity summary to the appendix. We also added more explicit interpretation around the synthesis table and strengthened the limitations/future-work discussion, especially around stochastic coverage, single-toolchain dependence, and broader cross-device validation.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public.
Paper Type: Full-length papers (i.e. case studies, theoretical, applied research papers). 8 pages
Reroute: false
Submission Number: 39
Loading