Blog

← Back to blog

The Schema Registry Pattern for LLM Output Contracts

Schema validation isn't optional for LLM apps. Learn the registry pattern that makes output contracts enforceable in production — and why 95% never ship.

The Schema Registry Pattern for LLM Output Contracts

Vertical registry spine with four JSON Schema contracts (OrderSummary, RefundProposal, EscalationPacket, Refusal) attached.

Why 95% of custom GenAI apps never reach production — and the one pattern that holds the line.

If you read last week's Circuit Breakers for LLMs and the Layered Prompt Injection Defense, you have seen the input-contract half of the harness. This post is the output-contract companion — the schema registry pattern that makes the LLM's reply enforceable in production, the way a database CHECK constraint makes a column enforceable. It is the most load-bearing component in our AI Harness Engineering practice, and the reason the MIT NANDA GenAI Divide report's 95% pilot-failure number [^1] is reversible.

The fix is runtime validation llm as the backstop, schema validation openai Structured Outputs as the frontstop, and a json schema llm registry as the contract. Six components, in order.

1. Schema as the contract

A versioned JSON Schema (draft 2020-12) in schemas/v1/ is the unit of truth. Once a contract is in use, the version is frozen. Pydantic v2 (Python) and Zod 4 (TypeScript) are the dominant SDK-level validators; both export to JSON Schema for the provider. Pydantic v2 powers ~8,000 PyPI packages; Zod 4 is the de facto TypeScript standard [^2].

2. Provider-side enforcement

Turn on strict: true + text.format = { type: "json_schema", ... } for openai structured outputs, strict: true on every Anthropic tool, responseSchema for Gemini, and toolConfig for Bedrock. OpenAI's eval on gpt-4o-2024-08-06 with Structured Outputs scores a perfect 100% on complex schema adherence; gpt-4-0613 scores below 40% [^3]. Anthropic's strict tool use — the second leg of function calling validation — guarantees "Functions receive correctly-typed arguments every time. No need to validate and retry tool calls" [^4]. The third leg, function calling validation at the JSON Schema level, is the part most teams still skip.

3. Runtime validation as the backstop

Constrained decoding still fails in two ways: type-correct but semantically wrong (passengers: 2 on a fully-booked flight), and edge cases in the supported-schema subset. A second pass through Pydantic v2, Zod 4, instructor, or Outlines — the canonical json schema llm validator stack — catches both. Runtime validation llm is the floor beneath the openai structured outputs ceiling.

4. The single repair-and-retry loop

A failed runtime validation does not fail the request. The harness returns the validation errors to the model once with a re-prompt. instructor's max_retries=3 is the canonical config; the production consensus is two retries (one repair, then escalate) — beyond that, the prompt, schema, or model is wrong, and a human needs to look [^5].

5. Refusal-class detection

Provider structured output can return a refusal field when the model declines for safety. This is a normal response, not an exception. Add a refusal: string | null to the schema, a discriminated union on success, or a top-level error variant, and route it to a distinct UI surface. OpenAI: "Since a refusal does not follow the schema you have supplied in response_format, the API has a new field refusal" [^3].

6. Audit trail on the OTel GenAI span

Every request emits a span with gen_ai.request.schema.name, gen_ai.request.schema.version, gen_ai.response.validation.verdict, gen_ai.response.validation.retry_count, and gen_ai.response.refusal.reason. The auditor's one-query answer to "show me the conversation, and what the schema required" is one trace query.

What the architecture looks like

Request flow: Provider Structured Output → Runtime Validator → Repair-and-Retry → Refusal → Downstream.

Figure 1: Request flow through the schema registry pattern — provider constrained decoding feeds a runtime validator, which feeds a single repair-and-retry loop, which routes validated responses downstream and refusals to a separate UI surface.

The integration point is the schema-driven circuit breaker we covered in the circuit breaker post. A schema-validation failure on Model A increments the breaker for Model A only, not globally. The schema is the unit of accounting for the per-model breaker state. Function calling validation is the third leg of the json schema llm contract; together with input scanning, it forms the schema registry pattern boundary.

A composite scenario (composite, labeled)

Illustrative Scenario — Composite. The team, the order, the bot, and the metrics are invented. The pattern and the citations are real.

A four-engineer platform team runs an LLM support assistant that calls four tools: get_order, get_customer, propose_refund, escalate_to_human. In Q1, they use openai.ChatCompletion.create(...) with response_format: { type: "json_object" } — JSON mode, not Structured Outputs. The model occasionally returns {"refund_amount": "forty dollars"} or omits the reason field. The dispatcher throws. The on-call pages at 2 a.m.

In Q2, the team adopts the registry. Four JSON Schemas in schemas/v1/: OrderSummary, RefundProposal, EscalationPacket, Refusal. The OpenAI call becomes client.responses.parse(model=..., text_format=RefundProposal, ...). Pydantic v2 runs a second pass: refund_amount must be a positive number, reason one of seven enumerated strings. On validation failure, the harness re-prompts once with the Pydantic error; second failure routes to Refusal and pages the on-call. The 2 a.m. pager rate drops to zero.

The buyer's 7-item pre-deploy checklist

  1. Define the schema as the contract, not the implementation. Version it, freeze it, and make the file the source of truth.
  2. Turn on provider-side openai structured outputs and strict mode for every model call. Never default to JSON mode — it is a hint, not a guarantee.
  3. Run runtime validation llm as the backstop. Pydantic v2 / Zod 4 / instructor / Outlines. Catch type-correct but semantically wrong outputs.
  4. Implement the single repair-and-retry loop. max_retries=2 is the production consensus.
  5. Treat refusal as a normal response, not an exception. Add the field, the union, or the variant to every schema.
  6. Make the schema the unit of accounting. Cost and rate controls charge per validated response, not per token.
  7. Emit the schema and verdict on the OTel GenAI span. Without these, the on-call has no way to attribute a failure to the right component.

This pattern is the floor, not the ceiling. Everything above it — agent state, planning, reflection — assumes this layer is in place. With it, the 95% starts to look a lot more like a 5%.


Want the full system design? Download the architecture blueprint →

Sources

  1. MIT NANDA GenAI Divide report, summarized in Innovative Human Capital, 2025-12-09. https://www.innovativehumancapital.com/article/the-genai-divide-why-95-of-enterprise-ai-investments-fail-and-how-the-5-succeed
  2. Pydantic v2 docs. https://docs.pydantic.dev/latest/ · Zod 4 docs. https://zod.dev/
  3. OpenAI, Structured Outputs guide. https://platform.openai.com/docs/guides/structured-outputs
  4. Anthropic, Strict tool use docs. https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/strict-tool-use.md
  5. instructor (Python) docs and GitHub README. https://python.useinstructor.com/ · https://github.com/567-labs/instructor