5 Layers of Prompt Injection Protection for Production LLM Apps
Prompt injection has held the #1 spot on the OWASP Top 10 for LLM Applications for three consecutive years, and the 2025 update added a multimodal category on top of direct and indirect.[1] Every major production incident in 2024-2025 — Slack AI in August 2024, EchoLeak against Microsoft 365 Copilot in May 2025, and the GitHub Copilot Chat class disclosed in August 2025 — was an indirect injection riding on content the LLM retrieved, not on what the user typed.[2][3][4] Production-grade prompt injection protection is no longer a single scanner. It is a five-layer composition: llm input scanning, prompt injection detection, isolation, llm output validation, and observability. This post is the tactical checklist for each layer.
The 5-layer framework itself is not new. We introduced it as an introductory mental model in 2023,10 and on Tuesday we published the architectural deep-dive covering the detection cascade, the indirect-injection attack surface, and the audit-trail wiring for SOC 2 and the EU AI Act.11 Today's post is the checklist companion: one page, one 4-point checklist per layer, designed to be the reference an engineering lead prints and tapes to the wall. The deep-dive is for the architect; this list is for the engineer wiring it next sprint. The mental model: each layer has a job, and the only way the whole stack fails is if a single attack slips past every layer at once. Together they form the production ai security guardrails every regulated LLM deployment needs.1011
Layer 1 — Input Validation and Sanitization
Goal: make the untrusted input boring. Every byte the LLM sees should be the byte a human would see.
This is the cheapest layer and the one most teams under-invest in. Every defense upstream depends on the LLM seeing a canonical, typed, bounded input. The four checks below are the prompt injection protection baseline; every other layer assumes Layer 1's llm input scanning has already done its job.
Figure 1: The five-layer prompt-injection protection checklist matrix. Accent color marks Balacode-shipped capabilities versus architecture-design items.
The 4-point checklist:
- NFKC unicode normalization. Strip zero-width characters, collapse homoglyphs, reject ASCII-smuggling payloads.[5][9]
- Size, depth, and time caps. Reject inputs larger than 16 KB at the LLM boundary, cap conversation depth, cap wall-clock time. The point is bounded blast radius.
- Encoding unwrap. Decode base64, ROT13, and hex payloads; if the decoded span contains an instruction, flag and quarantine.
- Type and schema enforcement. Treat user input as untyped bytes at the edge. Require a typed envelope (function-call args, RAG document) before it reaches the LLM.
What this layer does not catch: novel paraphrases, encoded payloads inside a structured field, or indirect injection via a user-controlled API. Those are Layer 3 territory.
The phrase worth keeping in mind at this layer is llm input scanning: every defensive technique in Layer 1 is some form of input scanning, and the layer's value is in making the input boring before any classifier or model ever sees it. Get the canonicalization wrong and every downstream classifier inherits a polluted prompt; get it right and Layer 2's job becomes dramatically easier.
Layer 2 — Prompt Injection Detection
Goal: catch the 95% of known attacks cheaply and flag the 5% of novel attacks for handling downstream.
The 4-point checklist:
- Regex / signature pass against a hot-reloadable pattern list. Catch the obvious: "ignore previous instructions," "DAN," "developer mode," markdown-image exfil URLs, base64-lookalike spans. Sub-5 ms.
- Heuristic / entropy pass. Shannon-entropy threshold (above 4.5 bits/char on a 100-token span is suspicious); instruction-verb density check ("ignore," "forget," "system," "override"). Sub-10 ms.
- Model-based judge for novel attacks. A small classifier scores the input as "likely injection / likely safe." 30–200 ms, on the order of $0.0001 per call. Microsoft Prompt Shields and Cohere Prompt Guard 2 are the canonical production examples.[5]
- Ensemble and voting. Run two or more judges and apply a voting rule. Even one judge plus one deterministic check is enough to drop false-positive rate substantially.
Lakera's PINT benchmark numbers, in production order: Lakera Guard 97.71%, Azure AI Prompt Shield for Documents 91.19%, ProtectAI DeBERTa (local OSS) 88.66%.[6] Even the best classifier has a false-negative tail — which is why production-grade prompt injection detection is a composition, not a choice.
What this layer does not catch: a sufficiently novel attack that all judges miss. That is what Layer 5 exists to catch via anomaly detection. The OWASP LLM Prompt Injection Prevention Cheat Sheet is the canonical pattern reference here; read it once and keep it open while you build.[9]
Layer 3 — Prompt Isolation and Privilege Separation
Goal: ensure the LLM's instructions and the LLM's data are not the same channel.
The 4-point checklist:
- Three distinct channels for system prompt, user prompt, and tool result. Never concatenate them. Use reserved special tokens or wrappers — Spotlighting datamarking or StruQ's secure front-end — so the model can distinguish.[5][7]
- Indirect-injection defense at the data boundary. Every piece of data the LLM consumes (RAG doc, web fetch, email, tool result, image OCR) is marked untrusted at the moment of entry. Untrusted spans are quoted, not executed.
- Capability-bounded tool execution (CaMeL pattern). A planning LLM produces code; a separate value-extraction LLM handles untrusted data with no tool access; the interpreter applies security policies based on data lineage.[8] (Architecture design at Balacode — not shipped.)
- Privilege separation per tenant and per request. Different tenants get different tool scopes. A read-only LLM in a sandboxed tenant cannot reach the production database.
What this layer does not catch: a clever indirect-injection payload that smuggles instructions through a legitimate data channel (a Slack message from a compromised account, for example). For that, Layer 2's classifier and Layer 5's anomaly detection are the backstop.
Layer 3 is where the deep-dive we published Tuesday earned its keep; it is also where most teams confuse "we do string concatenation" with "we have isolation." A clean channel discipline plus a CaMeL-style capability boundary is the gap that closes EchoLeak-class exploits.[3][8]
Layer 4 — Output Validation and Sanitization
Goal: make the LLM's output boring too. A confused model emits malformed instructions; a leaking model emits PII. This is the llm output validation layer.
The 4-point checklist:
- Schema-validate every output. If the LLM emits a function call, the arguments must match the registered JSON schema. If they do not, refuse and re-prompt — this is also the "injection via malformed tool call" backstop.
- PII detection on output. Redact emails, phone numbers, SSNs, credit cards, and API keys before the response leaves the LLM boundary. Maintain a redaction log per request for the audit trail.
- Jailbreak-output detection. A second model-judge asks "did this output contain instructions that did not appear in the system prompt or user input?" — a data lineage check.
- Action authorization. Before any tool call, the harness asks: is the recipient (file path, URL, email address) one the user specified, or one the LLM derived from a tool result? If the latter, escalate to a human (HITL).[4][5]
Effective LLM output validation is the difference between "the LLM tried to do something dangerous" and "the LLM succeeded in doing something dangerous." The first is a recoverable log line; the second is a breach notification. Without it, the LLM's response stream is the largest untrusted input your system ever ingests.
What this layer does not catch: a technically compliant output that is semantically harmful (a perfectly valid JSON that contains a recommendation the user should not have). For that, the LLM provider's safety training is the backstop and Layer 3's privilege separation limits the blast radius.
Layer 5 — Observability and Response
Goal: when an attack lands, know it, contain it, and learn from it.
The 4-point checklist:
- Per-request OpenTelemetry span. Every layer (input validate, regex, heuristic, judge, LLM call, output sanitize, schema validate, log) emits a span with
layer.name,layer.verdict,layer.confidence,layer.latency_ms, andlayer.cost_usd. (Shipped at Balacode.) - Anomaly detection on span aggregates. Sudden spike in "judge blocked" verdicts, sudden drop in user-input length, sudden rise in tool calls the user did not invoke — alert.
- Circuit breaker on the cascade. When retry depth exceeds a threshold or a scan layer itself starts failing, the breaker opens and the harness fails fast rather than burning budget on a known-bad request.12
- Incident response runbook. Pre-written: "if anomaly X fires, page on-call, snapshot the last 1,000 spans, switch to fail-closed mode for the affected tenant, post the IOCs to the tenant's Slack channel."[5]
What this layer does not catch: an attack that succeeds without triggering a measurable anomaly — the rare 0.1% case. For that, the audit trail (chain-hashed, append-only log) is the backstop. If you cannot prevent it, at least prove it happened. (Chain-hashed audit log is architecture design at Balacode, not shipped.)
This is the layer that turns the ai security guardrails above from a one-time shipping checklist into a living system. The span data is what the SOC 2 auditor and the EU AI Act conformity assessor will read. Without observability, every other layer is a static fence; with observability, the fence is actively patrolled. The same ai security guardrails pattern is what differentiates a production harness from a research notebook — the notebook stops at "did the model return JSON"; the harness keeps going through the alert, the circuit breaker, and the on-call page.
The 5-Layer Cheat Sheet
A compressed reference — tape it to the wall.
| Layer | What it owns | One-line checklist | Balacode status |
|---|---|---|---|
| 1 — Input validation | Canonicalize, cap, unwrap, type-enforce | NFKC + cap + decode + schema | Shipped |
| 2 — Detection | Catch known + novel attacks | Regex → heuristic → judge → ensemble | Shipped (regex + heuristic); judge in architecture design |
| 3 — Isolation | Separate instruction from data | Distinct channels + data lineage + CaMeL pattern | Shipped (channels); CaMeL in architecture design |
| 4 — Output validation | Validate, redact, judge, authorize | Schema + PII + judge + HITL | Shipped |
| 5 — Observability | Detect, contain, learn | OTel span + anomaly + breaker + runbook | Shipped (OTel span); chain-hashed audit log in architecture design |
Figure 2: Incident-to-layer mapping. Slack AI, EchoLeak, and GitHub Copilot Chat each rode a different combination of layers' misses; the matrix shows which layer would have caught each one in a properly composed stack.
For the architectural deep-dive on the detection cascade and audit-trail wiring for SOC 2, the EU AI Act, and NIST AI 600-2, see Prompt Injection Defense: A Layered Approach for Production LLM Apps. For the introductory 5-layer framing, see 5 Layers of Prompt Injection Defense. For circuit breakers in the cascade, see Circuit Breakers for LLMs.
If you're hardening LLM apps in production, book a 20-minute architecture review → — we will walk through your current detection and impact-mitigation stack, point out the gaps, and tell you which controls to wire first.
Sources
-
OWASP, LLM01:2025 Prompt Injection, OWASP Top 10 for LLM Applications. genai.owasp.org ↩
-
Microsoft Security Response Center, How Microsoft Defends Against Indirect Prompt Injection Attacks, 2025-07-29. microsoft.com ↩
-
Aim Security, EchoLeak (CVE-2025-32711), 2025-06-11. NVD: nvd.nist.gov ↩↩
-
GitHub Security, Safeguarding VS Code Against Prompt Injections, 2025-08-25. github.blog ↩↩
-
Microsoft Security Response Center, How Microsoft Defends Against Indirect Prompt Injection Attacks, 2025-07-29. microsoft.com (in-depth pattern reference; cited again at Layers 1, 3, 4, 5 for the specific technique each layer borrows) ↩↩↩↩↩
-
Keegan Hines et al., Defending Against Indirect Prompt Injection Attacks With Spotlighting, Microsoft Research / arXiv, 2024. arxiv.org; Berkeley AI Research, StruQ: Defending Against Prompt Injection with Structured Queries, USENIX Security 2025. bair.berkeley.edu ↩
-
Edoardo Debenedetti et al., CaMeL: Defeating Prompt Injections via a Robustly-Learned Capability-Based P-LLM/Q-LLM Architecture, Google DeepMind / ETH Zürich / ICLR 2025. arxiv.org ↩↩
-
OWASP, LLM Prompt Injection Prevention Cheat Sheet. cheatsheetseries.owasp.org ↩↩
-
Balacode, 5 Layers of Prompt Injection Defense (introductory framework), 2026-06-03. ↩↩
-
Balacode, Prompt Injection Defense: A Layered Approach for Production LLM Apps (architectural deep-dive), 2026-06-06. ↩↩
-
Balacode, Circuit Breakers for LLMs — Architecture, Config, and Fallback Logic, 2026-06-04. ↩