Trace Before Trust: Content Provenance, Model Integrity, and Service Attestation for Accountable Open‑Weight LLMs
Keywords: Content Provenance, Model Integrity, and Service Attestation for Accountable Open‑Weight LLMs
TL;DR: We explain the key criteria and metrics for evaluating open-source models to better understand how transparent and safe a model is.
Abstract: Open-weight large language models (LLMs) enable broad access and rapid innovation, but they also make it easy to derive models—via distillation, fine-tuning, compression, or editing—that remove safety protections, obscure provenance, or violate licenses. We advocate a coordinated accountability posture: keep models open, and make misuse measurable at the point of use. Concretely, services should provide watermarking for long-form, first-party artifacts, publish update-durable safety reports for released checkpoints, and identity proofs of the model they run (black-box challenge prompts and rotating fingerprints). This reframes accountability around three non-overlapping layers: Content Provenance (CP) for document-side checks, Model Integrity (MI) for behavior under bounded updates, and Service Attestation (SA) for identifying the root model running the system. We ground the framework in recent evidence on production-scale watermarking, safety erosion under small fine-tunes, tamper-resistant training, and reliable black-box attribution. We also specify a threat-budgeted evaluation protocol that tests distillation + paraphrase, quantization/pruning, parameter-efficient fine-tuning, and targeted edits with power-controlled metrics. The goal is not lock-down but auditable accountability: open access paired with practices that report the limitations in these three aspects.
Submission Number: 76
Loading