Blog

AI Engineering Jun 4, 2026 5 min read

Circuit Breakers for LLMs: Architecture & Fallback Logic

Build production-grade LLM circuit breakers with per-model thresholds, cost-aware fallback chains, and OpenTelemetry observability.

Three-node state diagram — solid (closed), pulsing (half-open), open with sparking particles — illustrating the LLM circuit breaker state machine.

Circuit Breakers for LLMs: Architecture, Config, and Fallback Logic

Your AI agent works perfectly in staging. Then production traffic hits, OpenAI returns a 429, Claude goes dark for two hours, and your users see empty responses. You did not have a circuit breaker. You had hope.

This post explains what a circuit breaker is, why the generic pattern fails for LLMs, and how to build one that understands model-specific failure modes, cost-aware fallback chains, and dynamic latency budgets. Every pattern here is something we ship in production harnesses at Balacode.

The Production AI Resilience Gap

Only 12% of AI agent projects reach production. The leading cause of death is not model quality. It is the absence of a production-grade harness: circuit breakers, retry discipline, and observability[^1]. Models are getting smarter, but reliability is not. GPT-5 does not fix rate limits. Claude 4 does not prevent provider outages.

The gap is structural. Most teams build agents as direct API calls to a single provider. When that provider fails, the agent fails. There is no fast-fail path, no fallback model, no telemetry to tell you what broke. The harness — the layer between your application logic and the raw LLM API — is the missing piece.

At Balacode, we treat the harness as first-class infrastructure. Circuit breakers are not an afterthought. They are the guardrail that keeps a single provider failure from becoming a system-wide outage.

What a Circuit Breaker Actually Does

A circuit breaker is a state machine with three states: Closed, Open, and Half-Open[^2].

In the Closed state, all requests pass through normally. The breaker counts failures in a sliding window. When the failure count crosses a threshold, the breaker trips to Open. In the Open state, all requests fail immediately — no network call is made. This is fast-fail: your system degrades gracefully instead of hammering a broken endpoint. After a timeout, the breaker enters Half-Open and allows a limited number of test requests. If they succeed, the breaker closes. If they fail, it opens again[^3].

Michael Nygard introduced the pattern in Release It! (2007) for distributed systems. Martin Fowler formalized it for microservices in 2014[^4]. The pattern works because it prevents cascading failures: when a downstream service is unhealthy, upstream callers stop waiting and start failing fast.

But the microservices version assumes two things that do not hold for LLMs.

First, it assumes failures are binary: HTTP 5xx or timeout means unhealthy, HTTP 200 means healthy. LLMs return HTTP 200 with hallucinated output, partial refusals, or degraded quality. A generic circuit breaker sees the 200 and keeps the circuit closed. Your users see garbage.

Second, it assumes all services are homogeneous. A 5-second timeout might be correct for one REST API but catastrophic for another. LLMs have different baseline latencies, rate limits, and failure profiles. A single threshold for "all LLMs" is either too lax for fast models or too aggressive for slow ones[^5].

LLM Failure Modes That Break Generic Circuit Breakers

Soft Failures: HTTP 200, Bad Output

The most dangerous LLM failure is the one that looks like success. A model returns HTTP 200 with a response that is grammatically correct but factually wrong, off-topic, or partially refused. Generic circuit breakers do not inspect response content. They count HTTP codes. A soft failure slips through every time[^6].

Production harnesses need content-level health checks. At Balacode, we validate output against schema constraints, guardrail rules, and consistency checks before the circuit breaker counts the call as successful. A 200 with invalid output is a failure.

Rate Limits vs. Hard Failures

A 429 (rate limit) from OpenAI is not the same as a 500 (server error). The 429 means back off and retry. The 500 means the provider is unhealthy. A generic circuit breaker treats both as "failure" and trips the circuit. That is wrong. You want to retry 429s with exponential backoff, not open the circuit[^7].

Conversely, a 503 from Anthropic during a global outage means the circuit should open immediately. The breaker must classify errors by type, not just count them.

Latency Degradation Without Errors

LLM latency is linear with output token count, not a binary healthy or unhealthy state[^8]. Time-to-first-token (TTFT) ranges from ~100ms for Gemini Flash to 1–3 seconds for reasoning models. Per-token latency adds linearly on top. A fixed 5-second timeout would trip on a legitimate long-form generation but miss a model that normally responds in 500ms and is now taking 4 seconds[^9].

The circuit breaker needs a dynamic latency budget: expected latency as a function of prompt length, expected output tokens, and the model's baseline profile. A 2000-token prompt to Claude gets an 8-second budget. A 100-token ping gets 1 second. Degradation is detected before the timeout fires.

Per-Model Heterogeneity

GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro have different baselines. A threshold calibrated for GPT-4o will misfire on Grok. A threshold calibrated for Claude will miss degradation on Gemini[^10]. The circuit breaker must be per-model, not global.

Production Outage Case Studies

OpenAI: 15+ Incidents in 2024, February 2025 Outage

OpenAI experienced more than 15 incidents between January and September 2024, with rate limiting and API errors as the most common failure modes[^11]. On February 26, 2025, elevated API errors and search disruptions affected ChatGPT and API users globally[^12].

The community response was consistent: developers needed multi-provider fallback. "Waiting for status pages to turn green is not a strategy" became a recurring theme on OpenAI forums[^13]. Teams without fallback chains lost hours of production traffic. Teams with circuit breakers and fallback models degraded gracefully.

Claude: March and June 2026 Outages

Anthropic's Claude hit #1 on the App Store in early 2026. Two days later, on March 2, a global outage affected web access, authentication, and model endpoints[^14]. The incident repeated on June 2, with widespread disruptions to the web interface, developer console, and Claude Code platform[^15].

Deployflow, an infrastructure monitoring firm, published "Multi-LLM Redundancy" as the recommended fix within hours of the June incident[^16]. The recommendation was not "wait for Anthropic to fix it." It was "route around it." That routing requires a circuit breaker that can detect the outage, open the circuit, and trigger a fallback before your users notice.

What Went Wrong

In every case, the root cause was the same: single-provider dependency with no fast-fail mechanism. Status pages lag reality by 5–10 minutes. By the time an outage is posted, user-impacting failures have already occurred. A circuit breaker with per-model health scoring detects degradation in seconds, not minutes.

Architecture: A Circuit Breaker Designed for LLMs

Generic circuit breakers treat all failures as network errors and all services as identical. An LLM-aware circuit breaker does neither. Here is the architecture we use in production harnesses at Balacode.

Per-Model Thresholds

Each model provider gets its own circuit breaker instance with independent configuration:

Parameter	GPT-4o	Claude 3.5 Sonnet	Gemini 1.5 Pro
Failure threshold	3 consecutive 5xx	3 consecutive 5xx	3 consecutive 5xx
Rate-limit tolerance	5 in 60s	5 in 60s	5 in 60s
Baseline TTFT	300ms	800ms	150ms
Per-token latency	15ms	25ms	10ms
Dynamic timeout	2× TTFT + (tokens × per-token)	Same formula	Same formula
Recovery window	120s	180s	120s
Half-open tests	2 synthetic prompts	2 synthetic prompts	2 synthetic prompts

These are illustrative baselines. Production thresholds must be calibrated against your own telemetry[^17].

The key insight: a rate limit on GPT-4o does not mean Claude is unhealthy. Each breaker operates independently. A failure on one model opens only that model's circuit. The others stay closed.

Cost-Aware Fallback Chains

When a circuit opens, the harness does not fall back to "any available model." It falls back through a ranked chain that considers cost-per-token, quality requirements, and latency budget.

Consider this chain: GPT-4o → Claude 3.5 Sonnet → Gemini 1.5 Pro.

If GPT-4o trips, the harness evaluates Claude. If Claude meets the latency budget and quality gate, the request routes there. But Claude costs roughly 3× more per token than GPT-4o for some workloads. The harness surfaces this cost delta to OpenTelemetry. Your SRE dashboard shows not just "fallback occurred" but "fallback cost increased by $0.0047 per request"[^18].

If the primary model is down and the fallback exceeds the cost ceiling, the harness can fast-fail with a clear error instead of silently burning budget. Cost is a first-class signal in the fallback decision.

Dynamic Latency Budgets

Static timeouts are wrong for LLMs. A 2000-token generation legitimately takes longer than a 50-token classification. The harness calculates expected latency per request:

expected_latency = (2 × baseline_ttft) + (expected_output_tokens × per_token_latency)

If the actual response time exceeds 2× the expected latency, the call is flagged as degraded. If degradation persists across the sliding window, the circuit opens. This catches slowdowns before they become timeouts[^19].

Retry Budgets with Exponential Backoff

Unconditional retries amplify failures. If a provider is rate-limiting, retrying immediately makes it worse. Production best practice is a retry budget: each run gets N retries total, with jittered exponential backoff[^20].

At Balacode, the harness implements retry budgets per model. A 429 triggers a retry with backoff. A 500 counts against the failure threshold. Once the retry budget is exhausted, the request fast-fails or triggers fallback. There is no infinite retry loop.

Half-Open Testing with Synthetic Prompts

When a circuit enters half-open, the breaker needs to test whether the model has recovered. Sending a real user request is risky: if the model is still degraded, the user gets a bad experience.

The harness uses synthetic "ping" prompts with known-good expected outputs. Two successful synthetic tests close the circuit. One failure reopens it. The synthetic prompts are lightweight — 10–20 tokens — and do not count against user-facing latency budgets[^21].

Circuit Breaker vs. Retry vs. Fallback

These three patterns are complementary, not interchangeable.

Retry handles transient failures. A 429 or a brief 503 recovers on the next attempt. Retry is the first line of defense. But retry without a budget is a denial-of-service attack on yourself.

Circuit breaker handles persistent failures. When a model is genuinely unhealthy — sustained 5xx errors, prolonged latency degradation — the breaker stops sending traffic and fails fast. This protects the provider from overload and protects your users from hanging requests.

Fallback handles the case where the primary model is unavailable but another model can serve the request. Fallback is what happens after the circuit opens. It is not a substitute for the breaker; it is the consequence of it.

The layered resilience stack looks like this:

Request arrives → validate input, check guardrails.
Route to primary model → attempt with retry budget and exponential backoff.
If retries exhaust → check circuit state. If closed, count failure. If threshold crossed, open circuit.
If circuit is open → evaluate fallback chain. Route to next model if cost and latency budgets allow.
If no fallback available → fast-fail with structured error. Log to telemetry.

No single layer handles every failure mode. The circuit breaker prevents cascading degradation. The retry handles noise. The fallback preserves availability.

Observability: What to Measure

A circuit breaker without observability is a black box. You need to know when it trips, why it trips, and what it costs.

OpenTelemetry Metrics

At Balacode, the harness exports standard OpenTelemetry metrics for every circuit breaker instance:

circuit_breaker.state — gauge: 0 = closed, 1 = half-open, 2 = open
circuit_breaker.failure_rate — ratio of failed to total calls in the sliding window
circuit_breaker.fallback.count — total fallback events per model pair
circuit_breaker.fallback.cost_delta — cost difference between primary and fallback model
circuit_breaker.latency.p99 — 99th percentile latency per model
circuit_breaker.retry.count — total retries per model

These metrics feed into standard SRE dashboards: Grafana, Datadog, New Relic. No custom instrumentation required[^22].

Key Dashboards

Circuit Health Overview: - Open circuit count per model - Time-to-recovery (Open → Closed) per model - Fallback rate by hour

Cost Impact: - Daily spend delta from fallback events - Cost per 1,000 requests by primary vs. fallback model - Budget burn rate during outages

Latency Trends: - TTFT p50/p99 per model - Degradation events (actual > 2× expected latency) - Half-open test success rate

Alerting Rules

We recommend three alert thresholds:

Warning: Fallback rate > 5% for any model for 10 minutes. Indicates degradation, not outage.
Critical: Circuit open for > 5 minutes on any model. Indicates sustained failure.
Emergency: Fallback cost delta > $0.01 per request for > 15 minutes. Indicates expensive fallback is burning budget.

What You Can Use Today

If you are running LLMs in production, here is the minimum viable resilience stack:

Separate 429 from 5xx. Rate limits need backoff, not circuit trips.
Set per-model timeouts. Do not use a global 30-second timeout. Calculate expected latency from prompt size and model baseline.
Implement a retry budget. Cap retries per request. Use jittered exponential backoff.
Add a fallback model. Even a cheaper, lower-quality model is better than no response.
Export circuit breaker metrics. OpenTelemetry is the standard. Use it.

These five changes will move you from "hope-based reliability" to "engineered resilience."

Want the Full System Design?

This post covers the principles. The implementation — per-model configuration schemas, fallback chain DSL, OpenTelemetry metric definitions, and Terraform modules for deployment — is documented in our architecture blueprint.

Download the Architecture Blueprint →

See also: our AI Harness Engineering service page for the production harness spec.