Circuit Breakers for LLMs: Architecture, Config, and Fallback Logic
Your AI agent works perfectly in staging. Then production traffic hits, OpenAI returns a 429, Claude goes dark for two hours, and your users see empty responses. You did not have a circuit breaker. You had hope.
This post explains what a circuit breaker is, why the generic pattern fails for LLMs, and how to build one that understands model-specific failure modes, cost-aware fallback chains, and dynamic latency budgets. Every pattern here is something we ship in production harnesses at Balacode.
The Production AI Resilience Gap
Only 12% of AI agent projects reach production. The leading cause of death is not model quality. It is the absence of a production-grade harness: circuit breakers, retry discipline, and observability[^1]. Models are getting smarter, but reliability is not. GPT-5 does not fix rate limits. Claude 4 does not prevent provider outages.
The gap is structural. Most teams build agents as direct API calls to a single provider. When that provider fails, the agent fails. There is no fast-fail path, no fallback model, no telemetry to tell you what broke. The harness — the layer between your application logic and the raw LLM API — is the missing piece.
At Balacode, we treat the harness as first-class infrastructure. Circuit breakers are not an afterthought. They are the guardrail that keeps a single provider failure from becoming a system-wide outage.
What a Circuit Breaker Actually Does
A circuit breaker is a state machine with three states: Closed, Open, and Half-Open[^2].
In the Closed state, all requests pass through normally. The breaker counts failures in a sliding window. When the failure count crosses a threshold, the breaker trips to Open. In the Open state, all requests fail immediately — no network call is made. This is fast-fail: your system degrades gracefully instead of hammering a broken endpoint. After a timeout, the breaker enters Half-Open and allows a limited number of test requests. If they succeed, the breaker closes. If they fail, it opens again[^3].
Michael Nygard introduced the pattern in Release It! (2007) for distributed systems. Martin Fowler formalized it for microservices in 2014[^4]. The pattern works because it prevents cascading failures: when a downstream service is unhealthy, upstream callers stop waiting and start failing fast.
But the microservices version assumes two things that do not hold for LLMs.
First, it assumes failures are binary: HTTP 5xx or timeout means unhealthy, HTTP 200 means healthy. LLMs return HTTP 200 with hallucinated output, partial refusals, or degraded quality. A generic circuit breaker sees the 200 and keeps the circuit closed. Your users see garbage.
Second, it assumes all services are homogeneous. A 5-second timeout might be correct for one REST API but catastrophic for another. LLMs have different baseline latencies, rate limits, and failure profiles. A single threshold for "all LLMs" is either too lax for fast models or too aggressive for slow ones[^5].
LLM Failure Modes That Break Generic Circuit Breakers
Soft Failures: HTTP 200, Bad Output
The most dangerous LLM failure is the one that looks like success. A model returns HTTP 200 with a response that is grammatically correct but factually wrong, off-topic, or partially refused. Generic circuit breakers do not inspect response content. They count HTTP codes. A soft failure slips through every time[^6].
Production harnesses need content-level health checks. At Balacode, we validate output against schema constraints, guardrail rules, and consistency checks before the circuit breaker counts the call as successful. A 200 with invalid output is a failure.
Rate Limits vs. Hard Failures
A 429 (rate limit) from OpenAI is not the same as a 500 (server error). The 429 means back off and retry. The 500 means the provider is unhealthy. A generic circuit breaker treats both as "failure" and trips the circuit. That is wrong. You want to retry 429s with exponential backoff, not open the circuit[^7].
Conversely, a 503 from Anthropic during a global outage means the circuit should open immediately. The breaker must classify errors by type, not just count them.
Latency Degradation Without Errors
LLM latency is linear with output token count, not a binary healthy or unhealthy state[^8]. Time-to-first-token (TTFT) ranges from ~100ms for Gemini Flash to 1–3 seconds for reasoning models. Per-token latency adds linearly on top. A fixed 5-second timeout would trip on a legitimate long-form generation but miss a model that normally responds in 500ms and is now taking 4 seconds[^9].
The circuit breaker needs a dynamic latency budget: expected latency as a function of prompt length, expected output tokens, and the model's baseline profile. A 2000-token prompt to Claude gets an 8-second budget. A 100-token ping gets 1 second. Degradation is detected before the timeout fires.
Per-Model Heterogeneity
GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro have different baselines. A threshold calibrated for GPT-4o will misfire on Grok. A threshold calibrated for Claude will miss degradation on Gemini[^10]. The circuit breaker must be per-model, not global.
Production Outage Case Studies
OpenAI: 15+ Incidents in 2024, February 2025 Outage
OpenAI experienced more than 15 incidents between January and September 2024, with rate limiting and API errors as the most common failure modes[^11]. On February 26, 2025, elevated API errors and search disruptions affected ChatGPT and API users globally[^12].
The community response was consistent: developers needed multi-provider fallback. "Waiting for status pages to turn green is not a strategy" became a recurring theme on OpenAI forums[^13]. Teams without fallback chains lost hours of production traffic. Teams with circuit breakers and fallback models degraded gracefully.
Claude: March and June 2026 Outages
Anthropic's Claude hit #1 on the App Store in early 2026. Two days later, on March 2, a global outage affected web access, authentication, and model endpoints[^14]. The incident repeated on June 2, with widespread disruptions to the web interface, developer console, and Claude Code platform[^15].
Deployflow, an infrastructure monitoring firm, published "Multi-LLM Redundancy" as the recommended fix within hours of the June incident[^16]. The recommendation was not "wait for Anthropic to fix it." It was "route around it." That routing requires a circuit breaker that can detect the outage, open the circuit, and trigger a fallback before your users notice.
What Went Wrong
In every case, the root cause was the same: single-provider dependency with no fast-fail mechanism. Status pages lag reality by 5–10 minutes. By the time an outage is posted, user-impacting failures have already occurred. A circuit breaker with per-model health scoring detects degradation in seconds, not minutes.
Architecture: A Circuit Breaker Designed for LLMs
Generic circuit breakers treat all failures as network errors and all services as identical. An LLM-aware circuit breaker does neither. Here is the architecture we use in production harnesses at Balacode.
Per-Model Thresholds
Each model provider gets its own circuit breaker instance with independent configuration:
| Parameter | GPT-4o | Claude 3.5 Sonnet | Gemini 1.5 Pro |
|---|---|---|---|
| Failure threshold | 3 consecutive 5xx | 3 consecutive 5xx | 3 consecutive 5xx |
| Rate-limit tolerance | 5 in 60s | 5 in 60s | 5 in 60s |
| Baseline TTFT | 300ms | 800ms | 150ms |
| Per-token latency | 15ms | 25ms | 10ms |
| Dynamic timeout | 2× TTFT + (tokens × per-token) | Same formula | Same formula |
| Recovery window | 120s | 180s | 120s |
| Half-open tests | 2 synthetic prompts | 2 synthetic prompts | 2 synthetic prompts |
These are illustrative baselines. Production thresholds must be calibrated against your own telemetry[^17].
The key insight: a rate limit on GPT-4o does not mean Claude is unhealthy. Each breaker operates independently. A failure on one model opens only that model's circuit. The others stay closed.
Cost-Aware Fallback Chains
When a circuit opens, the harness does not fall back to "any available model." It falls back through a ranked chain that considers cost-per-token, quality requirements, and latency budget.
Consider this chain: GPT-4o → Claude 3.5 Sonnet → Gemini 1.5 Pro.
If GPT-4o trips, the harness evaluates Claude. If Claude meets the latency budget and quality gate, the request routes there. But Claude costs roughly 3× more per token than GPT-4o for some workloads. The harness surfaces this cost delta to OpenTelemetry. Your SRE dashboard shows not just "fallback occurred" but "fallback cost increased by $0.0047 per request"[^18].
If the primary model is down and the fallback exceeds the cost ceiling, the harness can fast-fail with a clear error instead of silently burning budget. Cost is a first-class signal in the fallback decision.
Dynamic Latency Budgets
Static timeouts are wrong for LLMs. A 2000-token generation legitimately takes longer than a 50-token classification. The harness calculates expected latency per request:
expected_latency = (2 × baseline_ttft) + (expected_output_tokens × per_token_latency)
If the actual response time exceeds 2× the expected latency, the call is flagged as degraded. If degradation persists across the sliding window, the circuit opens. This catches slowdowns before they become timeouts[^19].
Retry Budgets with Exponential Backoff
Unconditional retries amplify failures. If a provider is rate-limiting, retrying immediately makes it worse. Production best practice is a retry budget: each run gets N retries total, with jittered exponential backoff[^20].
At Balacode, the harness implements retry budgets per model. A 429 triggers a retry with backoff. A 500 counts against the failure threshold. Once the retry budget is exhausted, the request fast-fails or triggers fallback. There is no infinite retry loop.
Half-Open Testing with Synthetic Prompts
When a circuit enters half-open, the breaker needs to test whether the model has recovered. Sending a real user request is risky: if the model is still degraded, the user gets a bad experience.
The harness uses synthetic "ping" prompts with known-good expected outputs. Two successful synthetic tests close the circuit. One failure reopens it. The synthetic prompts are lightweight — 10–20 tokens — and do not count against user-facing latency budgets[^21].
Circuit Breaker vs. Retry vs. Fallback
These three patterns are complementary, not interchangeable.
Retry handles transient failures. A 429 or a brief 503 recovers on the next attempt. Retry is the first line of defense. But retry without a budget is a denial-of-service attack on yourself.
Circuit breaker handles persistent failures. When a model is genuinely unhealthy — sustained 5xx errors, prolonged latency degradation — the breaker stops sending traffic and fails fast. This protects the provider from overload and protects your users from hanging requests.
Fallback handles the case where the primary model is unavailable but another model can serve the request. Fallback is what happens after the circuit opens. It is not a substitute for the breaker; it is the consequence of it.
The layered resilience stack looks like this:
- Request arrives → validate input, check guardrails.
- Route to primary model → attempt with retry budget and exponential backoff.
- If retries exhaust → check circuit state. If closed, count failure. If threshold crossed, open circuit.
- If circuit is open → evaluate fallback chain. Route to next model if cost and latency budgets allow.
- If no fallback available → fast-fail with structured error. Log to telemetry.
No single layer handles every failure mode. The circuit breaker prevents cascading degradation. The retry handles noise. The fallback preserves availability.
Observability: What to Measure
A circuit breaker without observability is a black box. You need to know when it trips, why it trips, and what it costs.
OpenTelemetry Metrics
At Balacode, the harness exports standard OpenTelemetry metrics for every circuit breaker instance:
circuit_breaker.state— gauge: 0 = closed, 1 = half-open, 2 = opencircuit_breaker.failure_rate— ratio of failed to total calls in the sliding windowcircuit_breaker.fallback.count— total fallback events per model paircircuit_breaker.fallback.cost_delta— cost difference between primary and fallback modelcircuit_breaker.latency.p99— 99th percentile latency per modelcircuit_breaker.retry.count— total retries per model
These metrics feed into standard SRE dashboards: Grafana, Datadog, New Relic. No custom instrumentation required[^22].
Key Dashboards
Circuit Health Overview: - Open circuit count per model - Time-to-recovery (Open → Closed) per model - Fallback rate by hour
Cost Impact: - Daily spend delta from fallback events - Cost per 1,000 requests by primary vs. fallback model - Budget burn rate during outages
Latency Trends: - TTFT p50/p99 per model - Degradation events (actual > 2× expected latency) - Half-open test success rate
Alerting Rules
We recommend three alert thresholds:
- Warning: Fallback rate > 5% for any model for 10 minutes. Indicates degradation, not outage.
- Critical: Circuit open for > 5 minutes on any model. Indicates sustained failure.
- Emergency: Fallback cost delta > $0.01 per request for > 15 minutes. Indicates expensive fallback is burning budget.
What You Can Use Today
If you are running LLMs in production, here is the minimum viable resilience stack:
- Separate 429 from 5xx. Rate limits need backoff, not circuit trips.
- Set per-model timeouts. Do not use a global 30-second timeout. Calculate expected latency from prompt size and model baseline.
- Implement a retry budget. Cap retries per request. Use jittered exponential backoff.
- Add a fallback model. Even a cheaper, lower-quality model is better than no response.
- Export circuit breaker metrics. OpenTelemetry is the standard. Use it.
These five changes will move you from "hope-based reliability" to "engineered resilience."
Want the Full System Design?
This post covers the principles. The implementation — per-model configuration schemas, fallback chain DSL, OpenTelemetry metric definitions, and Terraform modules for deployment — is documented in our architecture blueprint.