Governance for LLMOps: Data–Prompt–Model–Release Chain

Abstract— We extend governance from code to LLM artifacts: datasets, prompts, adapters, and serving graphs. Policies require dataset lineage and approvals, prompt evaluation baselines, red-team attestations for safety, and SLSA-style provenance for fine-tunes and adapters. Release gates ensure that only evaluated and signed LLM assets reach production, with rollback tied to prompt/model versions implicated in incidents. We present an artifact taxonomy, a policy pack, enforcement surfaces, and evaluation metrics that align LLM development with enterprise DevOps and AI-governance expectations. The design is Kubernetes- and GitOps-friendly, evidence-first (“no evidence, no exposure”), and compatible with established secure software development and reliability practices.

Voruganti Kiran Kumar

5/31/202510 min read

white concrete building during daytime
white concrete building during daytime

Abstract— We extend governance from code to LLM artifacts: datasets, prompts, adapters, and serving graphs. Policies require dataset lineage and approvals, prompt evaluation baselines, red-team attestations for safety, and SLSA-style provenance for fine-tunes and adapters. Release gates ensure that only evaluated and signed LLM assets reach production, with rollback tied to prompt/model versions implicated in incidents. We present an artifact taxonomy, a policy pack, enforcement surfaces, and evaluation metrics that align LLM development with enterprise DevOps and AI-governance expectations. The design is Kubernetes- and GitOps-friendly, evidence-first (“no evidence, no exposure”), and compatible with established secure software development and reliability practices.

Index Terms— LLMOps, provenance, safety evaluation, prompt governance, SBOM for AI, CI/CD, GitOps, release engineering, red-teaming, admission control.


I. Introduction

Large language model (LLM)–enabled systems change through four concurrent channels: (i) data (new corpora, redactions, augmentations), (ii) prompts and chains (templates, routing graphs, tools), (iii) models and adapters (fine-tunes, LoRA weights, tokenizer revisions), and (iv) serving graphs (orchestration of components under traffic rules). Traditional software pipelines treat code as the primary governed artifact; as a result, lineage is unclear for datasets and prompts, evaluations are inconsistent, and rollbacks are brittle when incidents implicate a specific prompt or adapter.

This paper specifies a unified governance approach for the data–prompt–model–release chain. We define an artifact taxonomy and a policy pack with machine-checkable obligations and attestable evidence, describe enforcement surfaces across CI, cluster admission, and GitOps reconciliation, and propose metrics and evaluation protocols suitable for engineering and audit. The aim is not to slow iteration but to shift evidence left and bind decisions to signed proofs, so changes ship quickly when safe and predictably roll back when not.

Our contributions are:

  1. A concrete taxonomy of LLM artifacts with governance-relevant metadata;

  2. A policy pack mapping risk-based obligations (lineage, evaluations, safety attestations, provenance) to machine-checkable gates;

  3. An evidence model for datasets, prompts, and model weights, extending software SBOM and provenance concepts to LLM assets;

  4. An enforcement architecture (CI gates, admission policies, GitOps) and a runtime safety layer;

  5. Metrics, baselines, and threats-to-validity for rigorous evaluation; and

  6. A realistic case study illustrating end-to-end behavior.

II. Background

AI governance. The NIST AI Risk Management Framework (AI RMF 1.0) articulates characteristics of trustworthy AI—valid and reliable, safe, secure and resilient, accountable and transparent, explainable and interpretable, privacy-enhanced, and fair. ISO/IEC 42001:2023 specifies an AI management system (AIMS) with roles and lifecycle controls. These emphasize documented lineage, evaluation, and accountability, which we operationalize as release-time evidence and admission policies.

Supply-chain assurance. SLSA v1.0 defines provenance levels for build integrity and traceability; in-toto and modern signing systems provide attestation and verification primitives. We adapt these to LLM assets: datasets, prompts, and weights become first-class, signable artifacts with provenance.

SRE and release practice. Canarying, error budgets, rollback discipline, and objective gates are standard in reliable service delivery. We extend these to prompt and model changes, ensuring the same rigor applied to code is applied to non-code artifacts that can materially change system behavior.

III. Threat Model and Risk Taxonomy

We classify risks by artifact:

  • Datasets: license non-compliance; privacy violations (PII leakage, insufficient consent); data poisoning; stale or mis-split eval sets; undocumented transformations.

  • Prompts and chains: prompt injection/jailbreaks; tool-misuse paths; ambiguous guardrails; untested jurisdiction-specific instructions; brittle templates that regress under tokenization or localization.

  • Models and adapters: tampered weights; drift or degradation under domain shift; unsafe fine-tuning; misaligned tokenizer; incompatibility with safety filters; unknown base model lineage.

  • Serving graphs: misrouted traffic across regions/jurisdictions; unsafe tool access; weak rate limits; inconsistent rollback for composite updates (prompt+adapter+retriever).

  • Supply chain: unsigned artifacts; missing attestations; SBOM gaps for model packages; weak or absent provenance; non-approved signers.

Governance must prevent unauthorized exposure, detect drift, and explain decisions with recourse (minimal changes to flip a denial) and a governed override for emergencies.

IV. Artifact Taxonomy and Policy Pack

A. Artifacts (governed units)

  1. Datasets—training, tuning, validation, and evaluation sets with:

    • Sources and collection methods; licenses and consent status;

    • Privacy screening and PII residual-risk rating;

    • Splits and transformation pipelines, each versioned;

    • Hashes for files/shards and an immutable manifest.

  2. Prompts and chains—system/user prompts, templates, chain/routing specs, tool call policies:

    • Parameterized templates with explicit placeholders;

    • Supported locales and jurisdiction tags;

    • Safety and evaluation references (baseline suites, jailbreak reports).

  3. Models and adapters—base model identifiers, fine-tuned checkpoints, LoRA/adapter weights:

    • Base model ID and license; tokenizer/version;

    • Training code and hyperparameters snapshot;

    • Evaluation summaries and safety attestations.

  4. Serving graphs—composition spec binding models, prompts, retrievers/indices, tools, filters, and traffic rules:

    • Versioned YAML/JSON referencing exact hashes;

    • Cohort/jurisdiction routing and rate limits;

    • Guardrail configuration (input/output policies).

B. Policy Pack (obligations and gates)

Dataset lineage and approval.
Obligations: datasheet; license and consent audit; privacy/PII screening; reviewer sign-off; differential privacy or redaction when required.
Gate: deny promotion if lineage or approvals missing; require updated risk assessment when source distribution changes materially.

Prompt evaluation.
Obligations: baseline quality metrics; robustness under paraphrase/locale; jailbreak/red-team results and mitigations; approval of a risk envelope (what the prompt may plausibly fail at).
Gate: deny if evaluation below thresholds or if new high-severity jailbreak without mitigation; require jurisdiction-specific checks when routing includes that region.

Model/adapters evaluation and provenance.
Obligations: task metrics and robustness; bias/safety evaluation appropriate to domain; SLSA-style provenance linking checkpoint → data/code/infrastructure; signatures from approved identities.
Gate: deny if signatures invalid, attestation absent, SBOM incomplete, or provenance below configured level for the target environment.

Serving safety.
Obligations: guardrails for input/output; tool-policy checks; rate limits; cohort/jurisdiction routing; incident hooks.
Gate: deny if serving graph references unapproved resources or lacks required filters in protected namespaces.

Promotion rule.
Promotion requires presence and validity of all applicable evidence; missing or stale evidence defaults to deny with recourse. Emergency overrides are allowed under controlled break-glass with expiry and post-mortems.

V. Evidence Model and Attestations

We extend software SBOM/provenance to AI artifacts:

  • Dataset Manifest (AI-SBOM-D): file hashes, sources, licenses, consent and privacy status, transformation graph, split hashing, and sign-offs.

  • Prompt Manifest (AI-SBOM-P): template text hash, parameters, locale coverage, evaluation bundle hashes (quality, robustness, jailbreak), mitigations, and approval.

  • Model/Adapter Manifest (AI-SBOM-M): base model ID, tokenizer/hash, training code and hyperparameters snapshot, evaluation bundle hashes, and safety attestations.

  • Serving Graph Manifest (AI-SBOM-S): component references (by hash), guardrails configuration, routing rules, rate limits, and jurisdiction tags.

Each manifest is signed; in-toto-style attestations link artifacts to build/train steps and evaluators. Provenance levels (analogous to SLSA) set environment thresholds (e.g., staging requires L2, production requires L3+). The evidence bundle accompanies every release request.

VI. Enforcement Surfaces

A. CI Gates (pre-merge and build)

  • Datasets: verify datasets referenced in training/tuning have approved manifests; run privacy scans and license checks; publish AI-SBOM-D and sign.

  • Prompts: on prompt change, run evaluation pipelines (quality, robustness, jailbreak); sign evaluation report; update AI-SBOM-P.

  • Models/adapters: generate AI-SBOM-M and provenance; sign artifacts and attestations; enforce minimum evaluation thresholds; block if missing or failed.

  • Serving graphs: validate schema and references; ensure all component manifests are signed and within recency windows.

Failures block artifact publication. Recourse suggestions (e.g., “add eval for locale X,” “attach missing provenance”) are produced automatically.

B. Admission Control (cluster-level)

  • CEL/ValidatingAdmissionPolicy: low-latency denies if required labels/annotations absent, if resources target protected namespaces without mandated guardrails, or if the referenced model/prompt hash is not on the approved list for the environment.

  • OPA/Gatekeeper: richer Rego constraints cross-check AI-SBOMs and attestations (approved signer list, minimum provenance tier), ensure that serving graphs reference only approved indices/tools, and that jurisdiction routing complies with policy.

Deny messages carry machine-readable reasons (rule IDs), enabling automated recourse.

C. GitOps Reconciliation (environment-level)

Controllers reconcile serving graphs only when evidence checks pass. Out-of-band changes (e.g., manual weight swap) are flagged and rolled back. Each reconciliation attaches decision traces (policies evaluated, evidence digests), enabling later audit replay.

D. Runtime Safety Layer

  • Input guardrails: PII scrubbing, prompt-injection heuristics, jurisdiction filters.

  • Output guardrails: toxicity/bias/off-policy filters, tool-access mediation.

  • Traffic policy: cohort/jurisdiction gating, rate limiting, and canarying for prompts and models.

  • Telemetry: structured logs linking every response to prompt/model/graph versions and cohort; incident hooks capture the implicated versions for deterministic rollback.

VII. Algorithms and Checking Methods

A. Evidence Completeness and Freshness

Define a completeness score EEE over required fields for each artifact class; define freshness windows (e.g., prompt eval ≤ 90 days old for production). Gates enforce E=1.0E = 1.0E=1.0 and recency by environment tier (stricter in regulated namespaces).

B. Prompt Evaluation Pipeline

  • Quality: task-specific metrics (accuracy, BLEU/ROUGE or domain proxies), per-locale checks.

  • Robustness: paraphrase, noise, and adversarial instruction tests; evaluate stability across tokenization changes.

  • Jailbreak/red-team: curated suites with graded severities; mitigation mapping (prompt changes, filters, tool restrictions).

  • Explainability: capture few-shot rationales or traces for error analysis.

Evaluation outputs a signed report with pass/fail vs. thresholds and a risk envelope (documented failure modes).

C. Model/Adapter Evaluation

  • Task metrics on appropriate datasets; robustness (domain and distribution shift); bias and safety checks relevant to application; latency and cost footprints.

  • Compatibility checks with guardrails and tokenizer; ablation where feasible to isolate adapter effects.

D. Serving Graph Validation

Static validation ensures that every reference resolves to an approved hash and that guardrails are configured for protected cohorts. Scenario tests (pre-prod shadow) verify off-policy behavior barriers.

E. Governance Logic

Promotion rule: allow only if all artifacts referenced by the serving graph pass evidence and evaluation checks, admission constraints are met, and canary SLIs remain within envelopes. Rollback rule: if incident flags version pair prompt, model, revert serving graph atomically to last known-good pair and open remediation.

VIII. Governance Workflows

A. Change Proposal and Risk Tiers

Every change includes a risk tier (e.g., Tier 0: doc, Tier 1: low-risk prompt wording, Tier 2: prompt logic or adapter change, Tier 3: new dataset or model). Higher tiers require more evidence, dual review, and longer canary dwell.

B. Approvals and Separation of Duties

  • Data steward: dataset approvals and privacy checks.

  • Model owner: adapter and model evaluations.

  • Safety officer: red-team attestations and mitigation sign-off.

  • Platform owner: serving graph and guardrails.

Approvals are recorded as signed attestations; no single role can approve all gates for Tier 3.

C. Recourse and Break-Glass

Denials include binding reasons and minimal edit plans (e.g., add SBOM, run locale L evaluation, attach missing signature, add output filter). Break-glass permits temporary exposure with TTL tokens, two-person rule, automatic rollback window, and mandatory post-mortem.

IX. Metrics and Evaluation Protocol

A. Key Performance Indicators

  • Coverage of lineage: fraction of deployed models/prompts with complete dataset lineage and approvals.

  • Evaluation sufficiency: pass rates against mandatory suites; deltas vs. baseline after change.

  • Safety performance: pre-prod vs. post-prod red-team/jailbreak finding rates; time-to-mitigation.

  • Release reliability: time-to-rollback for unsafe prompts/models; change-failure rate vs. code-only baselines.

  • Auditability: percentage of deployments with reproducible decisions from stored attestations; audit cycle time.

  • Operator burden: added CI time, reviewer effort, and denial-to-compliance time.

  • Drift and exceptions: out-of-band change rate; break-glass invocation rate and misuse rate.

B. Baselines

(a) Code-only governance (no LLM artifact gates); (b) model-only gating (provenance but no prompt/dataset controls); (c) evaluation without enforcement (reports not bound to promotion). The target is to reduce incidents and false exposures while keeping lead time acceptable for low-risk changes.

C. Scenarios

  • Dataset change: new corpus added; verify lineage, privacy, and re-evaluation.

  • Prompt change: updated refund policy prompt; evaluation finds jailbreak; mitigation required.

  • Adapter update: new LoRA weights with improved accuracy; must meet provenance tier and safety checks.

  • Serving graph edit: add tool with risky permissions; require policy and guardrail updates.

  • Jurisdiction expansion: enable feature in a new region; run locale-specific evaluations and compliance checks.

D. Threats to Validity and Mitigations

  • Evaluation drift: benchmarks stale or unrepresentative. Mitigation: periodic refresh, canary shadow tests, incident-driven suite updates.

  • Partial observability: proprietary base models limit transparency. Mitigation: require documented base IDs and licenses; emphasize adapter provenance and strong serving guardrails.

  • Over-fitted evals: passing a suite without real-world robustness. Mitigation: adversarial testing, red-team diversity, and post-incident learning loops.

  • Latency in gates: heavy checks slowing pipelines. Mitigation: separate low-latency admission (CEL) from richer audits (Rego/CI) with staged enforcement.

X. Case Study (Illustrative)

A customer-support assistant updates a refund-policy prompt and a fine-tuned adapter. The dataset manifest (AI-SBOM-D) shows synthetic augmentation with PII scrubbing, license review, and steward approval. Prompt evaluation reports acceptable quality and robustness but expose a jailbreak enabling tool misuse; mitigation adds an output filter and revises the system prompt; the report is signed and attached to AI-SBOM-P. The adapter’s AI-SBOM-M binds the checkpoint to the dataset and training code, with signatures and provenance at the production threshold.

At admission, CEL checks required labels and environment scope; Gatekeeper verified approved model/prompt hashes, provenance tier, and guardrail presence for the protected “refunds” namespace. GitOps advanced a 5% canary for the “retail-US” cohort. Telemetry remained within SLI envelopes; safety monitors observed no jailbreak triggers. Promotion proceeded to 25% and then 100%. An auditor later replayed the decision from the attestation bundle. Three weeks later, a separate incident implicated a different prompt; rollback reverted the serving graph atomically to the last known-good prompt/model pair in minutes, and remediation added a new jailbreak scenario to the evaluation suite.

XI. Discussion

Why govern prompts and datasets like code? They change model behavior as much as code. Treating them as signable, attestable artifacts with mandatory evaluations tightens the control loop and reduces ambiguous responsibility.

Safety vs. velocity. Governance is risk-tiered. Low-risk prompt wording changes with current evaluations may flow quickly; high-risk dataset or adapter updates require deeper evidence and longer canaries. Evidence and policies are first-class but designed for operational pragmatism.

Interoperability. The framework works with hosted and self-hosted models. When base-model transparency is limited, the policy shifts emphasis to adapter provenance, serving guardrails, and strong runtime monitoring.

Organizational adoption. Start with provenance-only gates for models/adapters and prompt hash approval; add dataset lineage and prompt evaluations as teams mature. Tie post-incident learning to evaluation suite updates and policy refinements.

XII. Limitations and Future Work

Some governance needs human judgment (e.g., fairness trade-offs, legal interpretations). Evaluation remains imperfect; adversarial creativity evolves. Future work includes standardizing AI-SBOM schemas, policy-driven active testing (selective data acquisition to break current prompts), causal impact analysis for prompt/model rollouts, and tighter jurisdictional compliance automation.

XIII. Conclusion

LLMOps governance requires elevating data, prompts, models, and serving graphs to first-class governed artifacts with signed evidence, enforceable policies, and reproducible decisions. By defining an artifact taxonomy, a policy pack, enforcement surfaces, and concrete metrics, this paper presents a practicable blueprint for trustworthy, auditable LLM releases. The outcome is faster, safer iteration with clear rollbacks and accountable decision-making—no evidence, no exposure.


References

[1] National Institute of Standards and Technology (NIST), Artificial Intelligence Risk Management Framework (AI RMF 1.0), 2023.
[2] ISO/IEC, ISO/IEC 42001:2023 — Artificial Intelligence Management System — Requirements, 2023.
[3] OpenSSF, Supply-chain Levels for Software Artifacts (SLSA) — Specification v1.0, 2023.
[4] N. R. Murphy, D. Rensin, B. Beyer, and C. Jones (eds.), The Site Reliability Workbook: Practical Ways to Implement SRE. O’Reilly, 2018.
[5] T. Gebru, J. Morgenstern, B. Vecchione, et al., “Datasheets for Datasets,” Communications of the ACM, vol. 64, no. 12, pp. 86–92, 2021.
[6] M. Mitchell, S. Wu, A. Zaldivar, et al., “Model Cards for Model Reporting,” in Proceedings of FAT (now FAccT), 2019.
[7] J. Cappos, S. Torres-Arias, R. Curtmola, and H. Wu, “in-toto: Providing Farm-to-Table Guarantees for Every Bit,” in Proceedings of USENIX Security, 2019.
[8] Sigstore Project, Sigstore: Design and Architecture for Software Artifact Signing and Verification, 2022.
[9] Open Policy Agent Project, “Open Policy Agent (OPA) and the Rego Policy Language,” CNCF documentation/white paper, 2021–2025.
[10] The Kubernetes Authors, “Policy Controllers and Admission,” documentation and release notes, 2022–2025.