Blog

← Back to blog

LLM Session Context Management: A Design Pattern

Design pattern for LLM session context management: token budgets, truncation guardrails, server-side compaction, and audit-ready observability.

LLM Session Context Management: A Design Pattern


Cross-section of an LLM context window: a fixed copper spine of system instructions and tool definitions, with rolling message history and cleared tool results on either side.

Hero — a single LLM request envelope, cross-sectioned: the static prefix (system + tools) on the left, the rolling message history in the middle, the cleared tool results and stripped thinking blocks on the right.

The unglamorous layer that decides whether a multi-turn product survives its tenth turn or collapses under its own context is llm session context management. The most common shape in 2026 is a messages=[...] array that grows without a rule until the provider silently truncates the tail, the model starts to "context rot," and a 2 a.m. page follows. The fix is not a bigger model — it is a design pattern: a deterministic, observable, budgeted way to keep the context window inside its own rules. This post is that pattern: the context window budget math, the llm truncation strategy decision tree, the session state design that makes truncation auditable, the gateway-level token budget enforcement that makes it enforceable, and the conversation memory design boundary we draw around vendor features we have not shipped.

The pattern sits inside AI Harness Engineering and is the load-bearing sibling of Circuit Breakers for LLMs and Prompt Injection Defense. It borrows its static-prefix source-of-truth from the Schema Registry Pattern. Everything here is either shipped by a vendor we use, or an architecture design we ship at Balacode. Nothing here claims agent memory, reflection loops, or multi-agent coordination.

§1. Problem

A multi-turn session accumulates five inputs, and only the first two are stable: the system prompt, the tool definitions, the rolling message history, the per-turn tool results, and (for reasoning models) the thinking blocks. Without an explicit rule, the context window grows monotonically.

First, the model starts to "context rot." Anthropic's September 29, 2025 post formalised the term and named an attention budget depleted by every token, not just relevant ones. [1] Liu et al.'s Lost in the Middle is the empirical underpinning: "performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts." [2]

Second, the request exceeds the model window. Anthropic documents 200K and 1M-token limits per Claude family member, and OpenAI's /responses/compact endpoint exists because unmanaged sessions blow past the window. [3][4] Provider-side hard truncation is the worst default: the caller cannot tell which turn was cut, the OTel span carries no gen_ai.usage.* counter for the dropped tokens, and the cache prefix silently invalidates.

Third, cost becomes un-explainable. A request mixing 4K cached system-prompt tokens, 80K of stale tool results, and 16K of fresh user input bills differently at every vendor. Without gen_ai.conversation.id on every span, the on-call cannot answer: which turn ran over budget?

§2. Forces

Four forces shape the pattern.

  • Determinism vs. summarisation cost. Server-side compaction (Anthropic) and /responses/compact (OpenAI) are managed, observable, and billable. Client-side auto-summarization is none of those. The pattern must choose: predictable cost and an audit trail, or a silent LLM call that drops state.
  • Static-prefix stability vs. dynamic-tail churn. A long system prompt plus a tool registry is the cheap part when it is cached. OpenAI: "Cache hits are only possible for exact prefix matches… place static content like instructions and examples at the beginning of your prompt." [5] If the llm truncation strategy touches that prefix, the cache invalidates and the bill goes up.
  • Observability vs. instrumentation cost. Without gen_ai.conversation.id on every span, post-incident analysis is a guess. With OpenTelemetry GenAI semconv (≥ 1.42) [6][7] it is one trace query — the cost is one decorator.
  • Hard-cut vs. refuse-vs-compact. Three deterministic responses to overflow: hard-cut oldest turns, refuse the new turn with a typed error, or invoke managed compaction. The pattern must declare which it does, in which order, and log the choice.

§3. Solution

Seven steps, run in order, every turn, each emitting a span so gen_ai.conversation.id is the join key for every on-call question.

§3.1. Treat the request envelope as three named buckets

A single LLM request is not a flat messages=[...] array. It is five named inputs with different lifetimes, and the session state design has to give each bucket its own rule.

Annotated breakdown of a single LLM request envelope: system prompt + tool definitions in the cached prefix, message history rolling, tool results cleared oldest-first, thinking blocks stripped.

Figure 1: A single LLM request envelope. Static prefix (system + tool definitions) sits left in the cached, cold-cost column. Message history rolls in the middle. Tool results are cleared oldest-first on the right. Thinking blocks are stripped before the request reaches the model.

The static prefix is the cached part: system instructions, tool definitions, schema registry fragments, prompt-cache key. Anthropic, OpenAI, and Google all treat this region as a separately-priced object. [3][5][8] Truncation rules must address each region by name, never the array as a whole.

§3.2. Cache the static prefix

Use the vendor's prompt-cache feature with cache_read.input_tokens on the OTel GenAI span. OpenAI exposes prompt_cache_retention for the 24-hour extended cache [5]; Anthropic exposes cache_read_input_tokens in the same field; Google's implicit caching is enabled by default for Gemini 2.5+ with one-hour TTLs. [8] The token budget enforcement layer must price cached tokens at the cached rate, not the input rate.

§3.3. Apply server-side compaction as the default overflow handler

Anthropic documents "server-side compaction" as "the primary strategy for context management" for long-running conversations. [3] OpenAI pairs the Responses API with the /responses/compact endpoint. [4] Both are billable inference calls, both produce a typed event in the conversation log, and both keep gen_ai.conversation.id constant across the compacted request. They are the only llm truncation strategy that is both vendor-managed and observable.

§3.4. Add a fine-grained context-editing step when compaction is too coarse

When compaction is overkill — say, a single stale tool result on a long-running session — Anthropic's context-management-2025-06-27: beta header with the clear_tool_uses_20250919 strategy clears oldest tool results server-side, and clear_thinking_20251015 strips thinking blocks, before the prompt reaches the model. [9] Cleared content is replaced with placeholder text, so the model knows it was removed rather than hallucinating a missing call.

§3.5. Propagate a session.id on every span

OpenTelemetry's gen_ai.conversation.id attribute is the single most under-used field in production LLM observability. [6] Helicone's Helicone-Session-Id and Helicone-Session-Path headers do the same thing at the gateway layer for teams not yet on OTel. [10] The session-id is the join key for every gen_ai.usage.* counter, truncation decision, and compaction event. Without it, the conversation memory design is invisible.

§3.6. Enforce the token budget at the gateway

The gateway is the right place to enforce the context window budget, not the application code. LiteLLM ships a pre-call context-window check and a context_window_fallback_dict that swaps to a larger-context deployment when the primary model is about to overflow. [11] Portkey, Helicone, Cloudflare AI Gateway, and Vercel AI Gateway expose the same primitive under different names. [12][13][14][15] The rule: project the request tokens, compare to the model window with a safety margin, route to (a) compaction, (b) a larger-window fallback, (c) a typed ContextWindowExceededError. This is the production site for token budget enforcement. Every step logs to the same gen_ai.conversation.id.

§3.7. Make overflow responses typed and explicit

If compaction cannot bring the request under the window, refuse the turn with a typed error rather than letting the provider silently truncate. Anthropic's stop_reason and OpenAI's finish_reason fields are the vendor analogues. [3][4] The error must carry the gen_ai.conversation.id, the turn that overflowed, and the gen_ai.usage.* counters at the point of refusal. Same typed-error convention the Prompt Injection Defense post uses for refuse-and-log.

Decision tree for context-window overflow: projected total tokens exceed the window, then try compaction, then try context_window_fallback_dict, then refuse with a typed ContextWindowExceededError. Each branch logs a gen_ai.usage.* attribute.

Figure 2: The truncation decision tree. Projected total > window → try compaction → try context_window_fallback_dict → refuse with ContextWindowExceededError. Each branch emits a gen_ai.usage.* span attribute under the same gen_ai.conversation.id.

§4. Consequences

Four matter.

  • Predictable cost. Tracking gen_ai.usage.cache_read.input_tokens vs. gen_ai.usage.input_tokens, plus a counter for compaction calls, makes the bill explainable.
  • Auditable context window. Every truncation decision has a session-id, a span, and an OTel attribute. The on-call can replay the exact messages=[...] array the model saw on turn 17 in a single trace lookup.
  • Vendor portability. OTel GenAI conventions abstract the same fields across Anthropic, OpenAI, Google, and AWS Bedrock. The Azure OpenAI Assistants API retirement (August 26, 2026) is the strongest evidence that vendor-specific session management is a depreciating asset; OTel spans are the appreciating one. [6][7][16]
  • Trade-off: compaction calls are not free. They are a billable inference. Compaction is opt-in per session, rate-limited, and thresholded — not invoked every turn. The threshold belongs in session state design, not a magic default.

Cost-attribution heatmap of a multi-turn session: rows are turns, columns are system, tools, messages, tool_results, thinking. Cache-hit columns stay cold; thinking columns grow non-linearly. Demonstrates where the budget burns.

Figure 3: Cost-attribution heatmap across a multi-turn session. Rows are turns. Columns are the five buckets from Figure 1. Cache-hit columns stay cold; thinking columns grow non-linearly. The context window budget burns in the columns we forget to instrument.

§5. Anti-patterns

Three failure modes are common in 2026.

  1. Naïve last-N-message truncation. "Just keep the last 20 messages and drop the rest." This invalidates the prompt cache, drops tool results the model still needs, and silently fails the lost in the middle problem. Truncate by bucket (system, tools, message history, tool results, thinking blocks). Never touch the cached prefix.
  2. Multi-agent session sharing. One gen_ai.conversation.id across cooperating agents — a "planner" and an "executor" sharing a thread — blows up cache-invalidation logic, makes the session-id useless for incident response, and crosses the Truth Boundary. The session is single-tenant. If cross-agent context is needed, hand off a summary with a new session-id and explicit provenance, never a shared thread.
  3. Client-side auto-summarization that loses critical state. Summarising the last 50 messages with the same model that generated them is silent: it hallucinates, drops numerical state, and emits no audit trail. Use the vendor's compaction endpoint, record the compacted summary as a typed event with gen_ai.usage.* counters, and keep the pre-compaction history cold under the same gen_ai.conversation.id. Anthropic's September 29, 2025 post names three levers for context engineering — compaction, structured note-taking, sub-agent architectures — and the first two are inside the design boundary for this pattern. The third is not in our current practice; cite it for the line it draws, do not adopt it. [1]

§6. Buyer's 7-item pre-deploy checklist

  1. Define the request envelope as five named buckets, not a flat messages=[...] array. Each bucket gets its own truncation rule.
  2. Cache the static prefix with the vendor's prompt-cache feature. Verify gen_ai.usage.cache_read.input_tokens is non-zero in production.
  3. Apply server-side compaction as the default overflow handler. Keep a typed-event log under the same gen_ai.conversation.id.
  4. Add a fine-grained context-editing step (clear_tool_uses_*, clear_thinking_*) for long-running sessions.
  5. Propagate gen_ai.conversation.id on every span, header, log line.
  6. Enforce the context window budget at the gateway. Pre-call check → compaction → context_window_fallback_dict → typed ContextWindowExceededError.
  7. Make overflow responses typed and explicit. Carry the gen_ai.usage.* counters in the error so the caller can decide.

The boundary this pattern draws is intentional. It is deterministic — every step is a rule, every rule has a span. It is observable — the gen_ai.conversation.id ties every truncation decision to a trace. It is bounded — the session is single-tenant, the prefix is cached, the overflow is typed. What the conversation memory design does not do is remember across the window, reflect, or coordinate across agents. Floor, not ceiling.


If you're dealing with this in production, book a 20-minute architecture review →


Sources



  1. Anthropic, Effective context engineering for AI agents, 2025-09-29. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents 

  2. Liu et al., Lost in the Middle: How Language Models Use Long Contexts, arXiv:2307.03172 v3, 2023-11-20. https://arxiv.org/abs/2307.03172 

  3. Anthropic, Context windows (Claude API docs, server-side compaction, 200K and 1M-token windows). https://docs.anthropic.com/en/docs/build-with-claude/context-windows 

  4. OpenAI, Compaction (in Conversation state guide; /responses/compact endpoint). https://platform.openai.com/docs/guides/conversation-state#managing-the-context-window 

  5. OpenAI, Prompt caching guide. https://platform.openai.com/docs/guides/prompt-caching 

  6. OpenTelemetry, Generative AI spans semantic conventions. https://github.com/open-telemetry/semantic-conventions-genai/blob/main/docs/gen-ai/gen-ai-spans.md 

  7. OpenTelemetry, Generative AI attributes registry. https://opentelemetry.io/docs/specs/semconv/attributes-registry/gen-ai/ 

  8. Google, Context caching (Gemini API docs, implicit + explicit caching). https://ai.google.dev/gemini-api/docs/caching 

  9. Anthropic, Context editing (Claude API docs, clear_tool_uses_20250919, clear_thinking_20251015). https://docs.anthropic.com/en/docs/build-with-claude/context-editing 

  10. Helicone, Sessions (header contract for Helicone-Session-Id and Helicone-Session-Path). https://docs.helicone.ai/features/sessions 

  11. LiteLLM, RoutingPre-Call Checks (Context Window) and context_window_fallback_dict. https://docs.litellm.ai/docs/routing 

  12. Portkey, AI Gateway (conditional routing, fallbacks, automatic retries, circuit breaker). https://docs.portkey.ai/docs/product/ai-gateway 

  13. Helicone, Sessions and Gateway features. https://docs.helicone.ai/features/sessions 

  14. Cloudflare, AI Gateway. https://developers.cloudflare.com/ai-gateway/ 

  15. Vercel, AI Gateway. https://vercel.com/docs/ai-gateway 

  16. Microsoft, Azure OpenAI Assistants (retirement notice, 2026-08-26). https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/assistants