Blog

AI Engineering Jun 6, 2026 5 min read

Prompt Injection Defense: A Layered Approach for Production LLM Apps

Indirect prompt injection is OWASP LLM01's hardest problem. Here is the production defense-in-depth stack — prevention, detection, impact mitigation, and the audit trail your auditor wants to see.

Abstract visualization of layered prompt-injection defense: clean data streams passing through a vertical electric-cyan detection plane while a single compromised red stream is filtered out.

Prompt Injection Defense: A Layered Approach for Production LLM Apps

Prompt injection has held the top spot in the OWASP Top 10 for LLM Applications for three consecutive years.^[1] Microsoft's MSRC called indirect prompt injection — where the malicious instruction rides in on content the LLM retrieves, not on what the user types — "one of the most widely-used techniques" in AI vulnerability reports received in 2025.^[2] The threat class has matured and the first real weaponization for data exfiltration in a major production system has been documented.

On Tuesday we published the introductory five-layer framework for prompt injection defense.^[bal-85] That post covered the what. This one moves to the operational layer — what runs in production, what gets logged, what an auditor sees. The thesis: production defense is prevention + detection + impact mitigation + audit trail, wired in a finite retry cascade, with a structured span per request that holds up against SOC 2, the EU AI Act, and NIST AI 600-2.

The Two Threat Classes — Direct vs. Indirect

Prompt injection has crystallized into two distinct categories with very different defenses. Direct injection is what most people picture: the user types instructions that alter the LLM's behaviour. It is bounded and can be hardened with instruction prioritisation and output-format validation. OWASP's LLM01:2025 entry treats direct injection as the easier half.^[1]

Indirect injection is the open problem. The LLM ingests untrusted content — a web page, an email, a RAG-retrieved document, a tool result, an image — and that content contains instructions. The user is unaware. The instructions can be hidden via white text, non-printing Unicode, or encoded payloads.

The 2024-2025 incidents were all indirect. In August 2024, PromptArmor disclosed a Slack AI vulnerability in which a single public-channel message could cause Slack AI to exfiltrate API keys from a private channel the attacker could not see.^[3] Six months later, Aim Security disclosed EchoLeak (CVE-2025-32711, CVSS 9.3) — a zero-click indirect prompt injection in Microsoft 365 Copilot. A single crafted email, no user interaction, caused Copilot to access internal files and exfiltrate their contents.^[4] Patched server-side, no in-the-wild exploitation, but the attack class is structural to any LLM with multi-source data access. The pattern: every major incident in the last 18 months was indirect injection combined with exfiltration via a tool the LLM had access to.

The Indirect Injection Attack Surface

The channels that produce indirect injection are not exotic. They are the channels a useful LLM-integrated app already has:

Channel	Real incident
RAG document	Slack AI, August 2024^[3]
Web page fetch	Bing Chat era demos
Email content	EchoLeak (CVE-2025-32711), May 2025^[4]
Tool result (search, query, code exec)	Microsoft data governance guidance^[5]
Image OCR / multimodal	OWASP LLM01:2025 Scenario #7^[1]

EchoLeak is the cleanest walkthrough.^[4] Four distinct bypasses in sequence: the XPIA classifier was evaded, link redaction missed the Markdown reference-style link syntax ([text][ref] with the URL stored separately), the Copilot UI's automatic image pre-fetch was used as the outbound trigger, and the Teams async file preview API was used as a redirect proxy to hide the attacker host behind a Microsoft domain. A single email, no click, internal files out. The root cause is structural: content channels look like data to the LLM, not instructions. Every defense pattern in the next section exists to recover that distinction.

Defense-in-Depth — Three Pillars

A production stack is not "which scanner is best." It is a composition of three pillars.

Pillar 1: Prevention

What we layer on top of a hardened system prompt:

Spotlighting (Microsoft Research). Three modes — delimiting, datamarking, encoding — wrap untrusted content in tokens the LLM is trained to treat as data, not instructions.^[2]
StruQ (CMU, USENIX Security 2025). Defends at the model level. Structured queries separate prompts and data into two channels via a secure front-end and a specially fine-tuned LLM, training the model to ignore instructions in the data portion.^[6]
CaMeL (Google / ETH Zürich, ICLR 2025). Defeats prompt injections by design. CaMeL extracts the control and data flows from the trusted query; untrusted data retrieved by the LLM cannot impact program flow. The LLM operates in a "code interpreter" mode where its outputs are values, not instructions.^[7]

Pillar 2: Detection

The 2026 vendor landscape, ordered by Lakera's PINT (Prompt Injection Test) benchmark:^[8]

Detector	Type	PINT score
Lakera Guard	API	97.71%
Azure AI Prompt Shield for Documents	API	91.19%
protectai/deberta-v3-base-prompt-injection	Local OSS	88.66%
WhyLabs LangKit	OSS library	80.02%
Azure AI Prompt Shield for User Prompts	API	77.50%

Even the best classifier is a probabilistic defense — every PINT score has a false-negative tail, and a determined attacker will find the inputs that land there. The pillar is necessary, not sufficient.

Pillar 3: Impact Mitigation

This pillar distinguishes a production stack from a scanner evaluation. The architectural insight: the strongest defense for indirect injection is not better detection, but reducing the blast radius of the LLM's tool access.^[2] Least-privilege tool tokens (scoped to exactly the data the LLM needs, read-only where possible, short-lived), sensitivity labels that gate which documents the LLM can read, no-markdown-exfil links (the EchoLeak vector was a Markdown reference-style link whose URL was an attacker host), and explicit user consent for any tool call that leaves the tenant boundary. When all three pillars are wired, a successful injection produces no exfiltration. The LLM can follow the injected instruction; the environment simply does not have the data or tools to make the instruction harmful.

The Repair-and-Retry Cascade

Detection catching a partial injection is the start of the next stage. A robust harness wires detection into a structured retry-and-repair cascade with a finite budget: canonicalize the input (unicode NFC, strip zero-width characters, detect homoglyphs); cheap pattern checks (~0.1ms regex and heuristic rules); classifier pass (Lakera or Azure Prompt Shields, ~50ms, either blocks outright or tags the input as untrusted); LLM call with the system prompt and the spotlit / encoded untrusted content; output sanitization (PII detection, jailbreak-output scan); schema validation (output must match a registered JSON schema; if not, repair retry, max 2 attempts);^[bal-14] judge model (a second LLM call asks "did this output comply with the system prompt?" If not, re-prompt with a hardened system prompt, max 1 attempt); structured log (OpenTelemetry span with input hash, scanner verdicts, retry count, judge verdict, PII redaction events, cost charged); budget consume (per-run cost attribution).

Retry budget is finite, and each retry uses a hardened variant of the system prompt. Naive repetition re-fails the same way. Circuit breakers sit above the cascade^[bal-88] — when retry depth exceeds a threshold or a scan layer itself starts failing, the breaker opens and the harness fails fast rather than burning budget on a known-bad request. CaMeL's capability model eliminates retry in many cases because untrusted content cannot influence program flow.^[7]

Audit-Trail Wiring for SOC 2 / EU AI Act / NIST AI 600-2

For an LLM serving regulated customers, the audit-trail fields you log per request matter more than the model itself for compliance. Eleven fields every LLM request log must have: timestamp and actor; canonical input hash; scanner verdict chain; system prompt version; model ID, version, and region; final prompt and response (PII-redacted, with hash of the unredacted form); retry count and judge verdict; PII redaction events; cost charged; tool calls and capabilities invoked; block/allow decision and reason.

These map to the three regulatory frameworks enterprise buyers are now asking about. SOC 2 uses CC7.2 (logging), CC8.1 (change management), and CC6.1 (access control). EU AI Act Articles 9 and 12, binding for high-risk systems starting August 2026, require a risk management system and immutable logs.^[eu-ai-act] NIST AI 600-2 GenAI Profile is the US federal guidance, with explicit mappings to MG.1.1 (logging integrity), MG.2.1 (detection monitoring), and MG.3.1 (tool-call traceability).^{[nist-ai-600-2]}

The pattern: every request becomes an OpenTelemetry span named llm.request with the eleven fields as attributes, shipped to your SIEM with immutable retention. The auditor gets a queryable trail; the forensic team gets a timeline; the regulator gets a conformity log. The same span, three audiences, one source of truth.^[ms-learn]

The Production Stack We Run at Balacode

Production stack: edge WAF feeds harness middleware, which forwards to LLM provider — *Figure 1: Production stack — edge WAF, harness middleware, LLM provider.*

What we have shipped and can claim in this post: the prompt-injection scanning layer (input canonicalization, regex, heuristic, and model-based classifier composition with structured verdicts),^[bal-19] the guardrail and safety loop infrastructure (the broader harness wiring scanning, output sanitization, retry/repair, and audit logging),^[bal-9] the schema registry (output schema validation enabling defensive retry when an injection causes malformed output),^[bal-14] and the WebSocket connection manager (the real-time harness execution layer for agentic LLM apps, where indirect injection is the highest-impact risk).^[bal-8] The OpenTelemetry observability schema and the per-tenant fail-open / fail-closed policy layer are part of our architecture design — we recommend them, but they are not separate shipped products.

Latency budget, end-to-end: edge 5ms + middleware 200-300ms + LLM 800ms ≈ 1 second. The detection pillar is the largest variable: a local OSS detector (ProtectAI DeBERTa, Rebuff) holds the middleware budget under 200ms; a cloud classifier (Lakera, Azure Prompt Shield) adds 50-150ms.

What to Do This Week

Five steps, sized to fit in a week. Wire an edge filter (Cloudflare Firewall for AI or equivalent, sub-5ms). Add canonicalization and a local classifier (ProtectAI DeBERTa or Rebuff in-process, behind the WAF). Log the eleven fields as an OpenTelemetry span per request, shipped to your SIEM. Scope your tool tokens — every LLM-callable tool gets a short-lived, least-privilege token, today not next quarter. Read the OWASP LLM01:2025 page and the Microsoft Learn guide to indirect prompt injection^[1]^[ms-learn] — both are short, both are correct.

For the adjacent layer — how circuit breakers compose with the cascade — see Circuit Breakers for LLMs. For the broader context on why production-grade AI infrastructure is now a Toronto / US East Coast priority, see Toronto's AI Scene: Why Infrastructure Is the Next Battleground.

References

^↩ OWASP GenAI Security Project, LLM01:2025 Prompt Injection (accessed 2026-06-06).
^↩ Paverd, Andrew. "How Microsoft defends against indirect prompt injection attacks." Microsoft Security Response Center, July 29, 2025. microsoft.com/en-us/msrc/blog/2025/07/…
^↩ PromptArmor, "Data Exfiltration from Slack AI via Indirect Prompt Injection," August 20, 2024. promptarmor.substack.com.
^↩ Aim Security, "EchoLeak: Zero-Click Indirect Prompt Injection in Microsoft 365 Copilot" (CVE-2025-32711, CVSS 9.3, May 2025). sentra.io/blog/copilot-echoleak-prompt-injection.
^↩ Microsoft, "Data, privacy, and security for Azure OpenAI Service" / AI data governance guidance. learn.microsoft.com.
^↩ Chen, Sizhe, et al., "StruQ: Defending Against Prompt Injection with Structured Queries," USENIX Security 2025. arxiv.org/abs/2402.06363.
^↩ Debenedetti, Edoardo, et al., "Defeating Prompt Injections by Design" (CaMeL), Google DeepMind / ETH Zürich, ICML 2025. arxiv.org/abs/2503.18813.
^↩ Lakera, "Lakera PINT (Prompt Injection Test) Benchmark." lakera.ai/product-updates/lakera-pint-benchmark.
^↩ Cloudflare, "AI security for apps: prompt injection detection." developers.cloudflare.com.
^↩ European Union, "Regulation (EU) 2024/1689 — Artificial Intelligence Act," Articles 9, 12, 15, 19. artificialintelligenceact.eu.
^↩ Microsoft Learn, "Defend against indirect prompt injection threats." learn.microsoft.com.
^↩ NIST, "AI 600-2 — Generative Artificial Intelligence Profile" (NIST AI Risk Management Framework). nvlpubs.nist.gov.
^↩ Balacode platform capability — WebSocket Connection Manager: the real-time harness execution layer for agentic LLM apps, where indirect injection is the highest-impact risk.
^↩ Balacode platform capability — Guardrail & Safety Loop Infrastructure: the broader harness wiring that scans input, sanitizes output, runs retry/repair on failure, and emits the per-request audit log.
^↩ Balacode platform capability — Schema Registry: registers expected output schemas; an LLM output that does not match triggers a defensive repair-retry before it ever reaches the user.
^↩ Balacode platform capability — Prompt-Injection Scanning Layer: in-process scanner with input canonicalization (NFKC, base64/ROT13 unwrap), regex and heuristic checks, and an optional model-judge for novel attacks.
^↩ Balacode, "5 Layers of Prompt Injection Defense for Production LLM Apps" (Tuesday's introductory post). balacode.io/blog/5-layers-prompt-injection-defense-production-llm.
^↩ Balacode, "Circuit Breakers for LLMs: Architecture & Fallback Logic." balacode.io/blog/circuit-breakers-llms-architecture-fallback.

If you're hardening LLM apps in production, book a 20-minute architecture review → — we will walk through your current detection and impact-mitigation stack, point out the gaps, and tell you which controls to wire first.

See also: our AI Harness Engineering service page for the production harness spec.

Prompt Injection Defense: A Layered Approach for Production LLM Apps

Prompt Injection Defense: A Layered Approach for Production LLM Apps

The Two Threat Classes — Direct vs. Indirect

The Indirect Injection Attack Surface

Defense-in-Depth — Three Pillars

Pillar 1: Prevention

Pillar 2: Detection

Pillar 3: Impact Mitigation

The Repair-and-Retry Cascade

Audit-Trail Wiring for SOC 2 / EU AI Act / NIST AI 600-2

The Production Stack We Run at Balacode

What to Do This Week

References

Related articles

AI Harness Engineering: Why Your POC Breaks in Production

The Schema Registry Pattern for LLM Output Contracts

The Toronto-Waterloo AI Corridor: Why a Toronto AI Engineering Consultancy Finds Enterprise Infrastructure Concentrating Here