Prompt Injection Defense: A Layered Approach for Production LLM Apps
Prompt injection has held the top spot in the OWASP Top 10 for LLM Applications for three consecutive years.[1] Microsoft's MSRC called indirect prompt injection — where the malicious instruction rides in on content the LLM retrieves, not on what the user types — "one of the most widely-used techniques" in AI vulnerability reports received in 2025.[2] The threat class has matured and the first real weaponization for data exfiltration in a major production system has been documented.
On Tuesday we published the introductory five-layer framework for prompt injection defense.[bal-85] That post covered the what. This one moves to the operational layer — what runs in production, what gets logged, what an auditor sees. The thesis: production defense is prevention + detection + impact mitigation + audit trail, wired in a finite retry cascade, with a structured span per request that holds up against SOC 2, the EU AI Act, and NIST AI 600-2.
The Two Threat Classes — Direct vs. Indirect
Prompt injection has crystallized into two distinct categories with very different defenses. Direct injection is what most people picture: the user types instructions that alter the LLM's behaviour. It is bounded and can be hardened with instruction prioritisation and output-format validation. OWASP's LLM01:2025 entry treats direct injection as the easier half.[1]
Indirect injection is the open problem. The LLM ingests untrusted content — a web page, an email, a RAG-retrieved document, a tool result, an image — and that content contains instructions. The user is unaware. The instructions can be hidden via white text, non-printing Unicode, or encoded payloads.
The 2024-2025 incidents were all indirect. In August 2024, PromptArmor disclosed a Slack AI vulnerability in which a single public-channel message could cause Slack AI to exfiltrate API keys from a private channel the attacker could not see.[3] Six months later, Aim Security disclosed EchoLeak (CVE-2025-32711, CVSS 9.3) — a zero-click indirect prompt injection in Microsoft 365 Copilot. A single crafted email, no user interaction, caused Copilot to access internal files and exfiltrate their contents.[4] Patched server-side, no in-the-wild exploitation, but the attack class is structural to any LLM with multi-source data access. The pattern: every major incident in the last 18 months was indirect injection combined with exfiltration via a tool the LLM had access to.
The Indirect Injection Attack Surface
The channels that produce indirect injection are not exotic. They are the channels a useful LLM-integrated app already has:
| Channel | Real incident |
|---|---|
| RAG document | Slack AI, August 2024[3] |
| Web page fetch | Bing Chat era demos |
| Email content | EchoLeak (CVE-2025-32711), May 2025[4] |
| Tool result (search, query, code exec) | Microsoft data governance guidance[5] |
| Image OCR / multimodal | OWASP LLM01:2025 Scenario #7[1] |
EchoLeak is the cleanest walkthrough.[4] Four distinct bypasses in sequence: the XPIA classifier was evaded, link redaction missed the Markdown reference-style link syntax ([text][ref] with the URL stored separately), the Copilot UI's automatic image pre-fetch was used as the outbound trigger, and the Teams async file preview API was used as a redirect proxy to hide the attacker host behind a Microsoft domain. A single email, no click, internal files out. The root cause is structural: content channels look like data to the LLM, not instructions. Every defense pattern in the next section exists to recover that distinction.
Defense-in-Depth — Three Pillars
A production stack is not "which scanner is best." It is a composition of three pillars.
Pillar 1: Prevention
What we layer on top of a hardened system prompt:
- Spotlighting (Microsoft Research). Three modes — delimiting, datamarking, encoding — wrap untrusted content in tokens the LLM is trained to treat as data, not instructions.[2]
- StruQ (CMU, USENIX Security 2025). Defends at the model level. Structured queries separate prompts and data into two channels via a secure front-end and a specially fine-tuned LLM, training the model to ignore instructions in the data portion.[6]
- CaMeL (Google / ETH Zürich, ICLR 2025). Defeats prompt injections by design. CaMeL extracts the control and data flows from the trusted query; untrusted data retrieved by the LLM cannot impact program flow. The LLM operates in a "code interpreter" mode where its outputs are values, not instructions.[7]
Pillar 2: Detection
The 2026 vendor landscape, ordered by Lakera's PINT (Prompt Injection Test) benchmark:[8]
| Detector | Type | PINT score |
|---|---|---|
| Lakera Guard | API | 97.71% |
| Azure AI Prompt Shield for Documents | API | 91.19% |
| protectai/deberta-v3-base-prompt-injection | Local OSS | 88.66% |
| WhyLabs LangKit | OSS library | 80.02% |
| Azure AI Prompt Shield for User Prompts | API | 77.50% |
Even the best classifier is a probabilistic defense — every PINT score has a false-negative tail, and a determined attacker will find the inputs that land there. The pillar is necessary, not sufficient.
Pillar 3: Impact Mitigation
This pillar distinguishes a production stack from a scanner evaluation. The architectural insight: the strongest defense for indirect injection is not better detection, but reducing the blast radius of the LLM's tool access.[2] Least-privilege tool tokens (scoped to exactly the data the LLM needs, read-only where possible, short-lived), sensitivity labels that gate which documents the LLM can read, no-markdown-exfil links (the EchoLeak vector was a Markdown reference-style link whose URL was an attacker host), and explicit user consent for any tool call that leaves the tenant boundary. When all three pillars are wired, a successful injection produces no exfiltration. The LLM can follow the injected instruction; the environment simply does not have the data or tools to make the instruction harmful.
The Repair-and-Retry Cascade
Detection catching a partial injection is the start of the next stage. A robust harness wires detection into a structured retry-and-repair cascade with a finite budget: canonicalize the input (unicode NFC, strip zero-width characters, detect homoglyphs); cheap pattern checks (~0.1ms regex and heuristic rules); classifier pass (Lakera or Azure Prompt Shields, ~50ms, either blocks outright or tags the input as untrusted); LLM call with the system prompt and the spotlit / encoded untrusted content; output sanitization (PII detection, jailbreak-output scan); schema validation (output must match a registered JSON schema; if not, repair retry, max 2 attempts);[bal-14] judge model (a second LLM call asks "did this output comply with the system prompt?" If not, re-prompt with a hardened system prompt, max 1 attempt); structured log (OpenTelemetry span with input hash, scanner verdicts, retry count, judge verdict, PII redaction events, cost charged); budget consume (per-run cost attribution).
Retry budget is finite, and each retry uses a hardened variant of the system prompt. Naive repetition re-fails the same way. Circuit breakers sit above the cascade[bal-88] — when retry depth exceeds a threshold or a scan layer itself starts failing, the breaker opens and the harness fails fast rather than burning budget on a known-bad request. CaMeL's capability model eliminates retry in many cases because untrusted content cannot influence program flow.[7]
Audit-Trail Wiring for SOC 2 / EU AI Act / NIST AI 600-2
For an LLM serving regulated customers, the audit-trail fields you log per request matter more than the model itself for compliance. Eleven fields every LLM request log must have: timestamp and actor; canonical input hash; scanner verdict chain; system prompt version; model ID, version, and region; final prompt and response (PII-redacted, with hash of the unredacted form); retry count and judge verdict; PII redaction events; cost charged; tool calls and capabilities invoked; block/allow decision and reason.
These map to the three regulatory frameworks enterprise buyers are now asking about. SOC 2 uses CC7.2 (logging), CC8.1 (change management), and CC6.1 (access control). EU AI Act Articles 9 and 12, binding for high-risk systems starting August 2026, require a risk management system and immutable logs.[eu-ai-act] NIST AI 600-2 GenAI Profile is the US federal guidance, with explicit mappings to MG.1.1 (logging integrity), MG.2.1 (detection monitoring), and MG.3.1 (tool-call traceability).[nist-ai-600-2]
The pattern: every request becomes an OpenTelemetry span named llm.request with the eleven fields as attributes, shipped to your SIEM with immutable retention. The auditor gets a queryable trail; the forensic team gets a timeline; the regulator gets a conformity log. The same span, three audiences, one source of truth.[ms-learn]
The Production Stack We Run at Balacode
`` ┌────────────────────────────────────────────────────────┐ │ EDGE (CDN / WAF) │ │ Cloudflare Firewall for AI — <5ms[cf] │ └────────────────────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────────────┐ │ HARNESS MIDDLEWARE (in-process) │ │ canonicalize → regex → heuristic → classifier │ │ → LLM call → output sanitize → schema validate │ │ → judge → structured log → budget consume │ │ 200-300ms — the actual defense stack │ └────────────────────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────────────┐ │ LLM PROVIDER (last line of defense, not the first) │ └────────────────────────────────────────────────────────┘ ``
What we have shipped and can claim in this post: the prompt-injection scanning layer (input canonicalization, regex, heuristic, and model-based classifier composition with structured verdicts),[bal-19] the guardrail and safety loop infrastructure (the broader harness wiring scanning, output sanitization, retry/repair, and audit logging),[bal-9] the schema registry (output schema validation enabling defensive retry when an injection causes malformed output),[bal-14] and the WebSocket connection manager (the real-time harness execution layer for agentic LLM apps, where indirect injection is the highest-impact risk).[bal-8] The OpenTelemetry observability schema and the per-tenant fail-open / fail-closed policy layer are part of our architecture design — we recommend them, but they are not separate shipped products.
Latency budget, end-to-end: edge 5ms + middleware 200-300ms + LLM 800ms ≈ 1 second. The detection pillar is the largest variable: a local OSS detector (ProtectAI DeBERTa, Rebuff) holds the middleware budget under 200ms; a cloud classifier (Lakera, Azure Prompt Shield) adds 50-150ms.
What to Do This Week
Five steps, sized to fit in a week. Wire an edge filter (Cloudflare Firewall for AI or equivalent, sub-5ms). Add canonicalization and a local classifier (ProtectAI DeBERTa or Rebuff in-process, behind the WAF). Log the eleven fields as an OpenTelemetry span per request, shipped to your SIEM. Scope your tool tokens — every LLM-callable tool gets a short-lived, least-privilege token, today not next quarter. Read the OWASP LLM01:2025 page and the Microsoft Learn guide to indirect prompt injection[1][ms-learn] — both are short, both are correct.
For the adjacent layer — how circuit breakers compose with the cascade — see Circuit Breakers for LLMs. For the broader context on why production-grade AI infrastructure is now a Toronto / US East Coast priority, see Toronto's AI Scene: Why Infrastructure Is the Next Battleground.
References
- ↩ OWASP GenAI Security Project, LLM01:2025 Prompt Injection (accessed 2026-06-06).
- ↩ Paverd, Andrew. "How Microsoft defends against indirect prompt injection attacks." Microsoft Security Response Center, July 29, 2025. microsoft.com/en-us/msrc/blog/2025/07/…
- ↩ PromptArmor, "Data Exfiltration from Slack AI via Indirect Prompt Injection," August 20, 2024. promptarmor.substack.com.
- ↩ Aim Security, "EchoLeak: Zero-Click Indirect Prompt Injection in Microsoft 365 Copilot" (CVE-2025-32711, CVSS 9.3, May 2025). sentra.io/blog/copilot-echoleak-prompt-injection.
- ↩ Microsoft, "Data, privacy, and security for Azure OpenAI Service" / AI data governance guidance. learn.microsoft.com.
- ↩ Chen, Sizhe, et al., "StruQ: Defending Against Prompt Injection with Structured Queries," USENIX Security 2025. arxiv.org/abs/2402.06363.
- ↩ Debenedetti, Edoardo, et al., "Defeating Prompt Injections by Design" (CaMeL), Google DeepMind / ETH Zürich, ICML 2025. arxiv.org/abs/2503.18813.
- ↩ Lakera, "Lakera PINT (Prompt Injection Test) Benchmark." lakera.ai/product-updates/lakera-pint-benchmark.
- ↩ Cloudflare, "AI security for apps: prompt injection detection." developers.cloudflare.com.
- ↩ European Union, "Regulation (EU) 2024/1689 — Artificial Intelligence Act," Articles 9, 12, 15, 19. artificialintelligenceact.eu.
- ↩ Microsoft Learn, "Defend against indirect prompt injection threats." learn.microsoft.com.
- ↩ NIST, "AI 600-2 — Generative Artificial Intelligence Profile" (NIST AI Risk Management Framework). nvlpubs.nist.gov.
- ↩ Balacode platform capability — WebSocket Connection Manager: the real-time harness execution layer for agentic LLM apps, where indirect injection is the highest-impact risk.
- ↩ Balacode platform capability — Guardrail & Safety Loop Infrastructure: the broader harness wiring that scans input, sanitizes output, runs retry/repair on failure, and emits the per-request audit log.
- ↩ Balacode platform capability — Schema Registry: registers expected output schemas; an LLM output that does not match triggers a defensive repair-retry before it ever reaches the user.
- ↩ Balacode platform capability — Prompt-Injection Scanning Layer: in-process scanner with input canonicalization (NFKC, base64/ROT13 unwrap), regex and heuristic checks, and an optional model-judge for novel attacks.
- ↩ Balacode, "5 Layers of Prompt Injection Defense for Production LLM Apps" (Tuesday's introductory post). balacode.io/blog/5-layers-prompt-injection-defense-production-llm.
- ↩ Balacode, "Circuit Breakers for LLMs: Architecture & Fallback Logic." balacode.io/blog/circuit-breakers-llms-architecture-fallback.
If you're hardening LLM apps in production, book a 20-minute architecture review → — we will walk through your current detection and impact-mitigation stack, point out the gaps, and tell you which controls to wire first.