Skip to content

Chapter 02: Providers & The Chokepoint#

In this chapter: - Why every LLM provider is a different wire format, four things that differ, and one abstraction that hides them all - The chokepoint pattern: one function that all LLM calls flow through, provider-detected from the model name - Prompt caching on Anthropic: the dynamic boundary, minimum thresholds, cache TTL, and a concrete cost savings table - How to read cache_write_tokens and cache_read_tokens to verify caching activated, and why Anthropic silently skips caching below the threshold

Note: You only need ANTHROPIC_API_KEY to follow along. OpenAI and LiteLLM backends are tested in this chapter but not required.


1. Motivation#

In Chapter 01 you made a raw HTTP call to Anthropic's Messages API. Now try to call GPT-4o.

You'd change the URL from https://api.anthropic.com/v1/messages to https://api.openai.com/v1/chat/completions. Change the auth header from x-api-key: sk-ant-... to Authorization: Bearer sk-.... Restructure the request body: OpenAI doesn't have a separate system field; it's a message with role "system". Change the response parser: Anthropic returns content[0].text, OpenAI returns choices[0].message.content.

Four changes for one provider swap. Add Gemini. Add Groq. Add Ollama. Every new provider is four more changes in every place you make LLM calls. The solution is a chokepoint.

The chokepoint pattern: one function that all LLM calls flow through. The function detects which backend to use from the model name, dispatches to the right backend, and returns a unified result type. Callers pass a model string. They get back a CallResult. They never touch endpoints, headers, or response parsers.

There's a second reason this chapter matters: prompt caching. Cache reads cost roughly 10% of normal input price. A 2000-token system prompt at 1000 calls/day costs \(1.60/day uncached. With caching, after the first write, 999 reads cost ~\)0.16/day, a 90% reduction from a single architectural decision.

Caching is not automatic. You must mark which parts of your prompt to cache and which to leave dynamic. This distinction, the dynamic boundary, is one of the most important concepts in this course. We build it here and return to it in Chapter 06 when many workers share a cached context block.

Historical context: why provider abstraction matters#

OpenAI's original Completions API (2020) had no caching, no session state, no multi-turn structure. Anthropic's Messages API added the system field, typed content blocks, and cache_control. The provider landscape exploded in 2023-2024: Anthropic, OpenAI, Gemini, Mistral, Groq, Ollama, each with different wire-level choices.

Hard-coding against a single provider is an architectural liability. Write 10,000 lines against the OpenAI API and switching to Claude is a migration project, not a configuration change. The chokepoint pattern, invented independently by many teams around 2023 and formalized in libraries like LiteLLM, turns provider changes into one-line configuration updates.

As context windows grew from 4k tokens (GPT-3, 2020) to 200k+ tokens (Claude 3, 2024), re-sending the same large context on every call became expensive. A 100k-token system prompt costs $0.08 per call on Haiku, $80 per 1,000 calls just for system context. Anthropic introduced prompt caching in July 2024 (prompt-caching-2024-07-31 beta header): mark content blocks with cache_control: {type: "ephemeral"} and Anthropic stores the KV-cache (key-value matrices from the transformer's attention layers) for that content. Subsequent calls with the same prefix skip recomputing those matrices, paying only cache read price.

For a static prefix like a system prompt, the KV matrices are always identical across calls. Computing them on every call is pure waste. Caching externalizes this: compute once, store, reuse.


2. First Principles#

Multiple LLM APIs exist with different wire-level details:

Provider Endpoint Auth Cache support
Anthropic POST /v1/messages x-api-key Yes, cache_control
OpenAI POST /v1/chat/completions Bearer token Automatic (no explicit control)
LiteLLM litellm.acompletion(model=...) Env vars No

What's the same: you send a system instruction and a user message, get back text, and care about token counts and cost. That's your abstraction boundary. The chokepoint function takes these common inputs, hides all per-provider plumbing, and produces a common output.

graph TD
    subgraph "Callers (provider-agnostic)"
        M3["Chapter 03\nAgent Loop + Tools"]
        M5["Chapter 05\nEval + Routing"]
        M6["Chapter 06\nFork-Join"]
    end
    subgraph "Chokepoint"
        CL["call_llm(prompt, model, system)"]
        DP["detect_provider(model)"]
        CL --> DP
    end
    subgraph "Provider backends"
        CA["_call_anthropic()\nPOST /v1/messages\nx-api-key\ncache_control blocks"]
        CO["_call_openai()\nPOST /v1/chat/completions\nAuthorization: Bearer"]
        CLL["_call_litellm()\nlitellm.acompletion()\n100+ providers"]
    end
    M3 & M5 & M6 --> CL
    DP -->|"claude-*"| CA
    DP -->|"gpt-*, o1, o3, o4"| CO
    DP -->|"everything else"| CLL

Detect the provider from the model name. Model strings are already namespaced: "claude-..." is Anthropic, "gpt-..." is OpenAI, "gemini/..." is Gemini via LiteLLM, "groq/..." is Groq via LiteLLM. You can route correctly without additional configuration.

def detect_provider(model: str) -> str:
    m = model.lower()
    if m.startswith("claude"):
        return "anthropic"
    if m.startswith(("gpt-", "o1", "o3", "o4")):
        return "openai"
    return "litellm"  # catch-all

Three cases, ordered by specificity. Anything that isn't Claude or GPT/o-series falls through to LiteLLM, which routes to Gemini, Groq, Ollama, Mistral, Bedrock, and more via the provider/model-name convention.

Aside: Extended Thinking, When to Pay For It

The o1, o3, and o4 entries above are OpenAI's reasoning models. Anthropic's equivalent is Claude's extended thinking mode (opted into with thinking: {type: "enabled", budget_tokens: N} on the request). In both families the model spends extra output tokens reasoning internally before answering, and you are billed for those tokens at output rates. Latency rises from a few hundred milliseconds to tens of seconds on hard prompts.

Worth the price when the task has a verifiable right answer that shallow reasoning gets wrong: hard proofs, multi-step planning under constraints, constraint-heavy debugging, chain-of-evidence synthesis across many documents. These are tasks where a non-reasoning model confidently produces a wrong answer that looks plausible. Snell et al. (2024, arXiv:2408.03314) show this regime cleanly: on hard problems, more inference-time compute outperforms switching to a larger base model.

Wasted on shallow tasks: classification, structured extraction, simple Q&A against a retrieved passage, format conversion, routine rewriting. The heuristic: if you can articulate "correct" in one sentence and the model is usually right without thinking, you are paying for latency and nothing else. A practical policy: route to a reasoning model only when the router (Chapter 07) tags a task as LARGE and reasoning is expected to substantially change the answer.

We'll formalize the Augmented LLM (tools, memory, and retrieval combined with a base model) in Chapter 03. For now, the key insight is that prompt caching is the first form of memory: when you mark the system prompt with cache_control, the agent "remembers" its configuration cheaply across thousands of calls.


3. Build It#

Open code/client.py.

Mock mode#

_MOCK = os.environ.get("SWARM_MOCK", "").lower() in ("1", "true", "yes")

Checked at the top of call_llm(). When set, the function returns a hardcoded response instantly, no network, no API keys, no cost. Every code file in this course supports this pattern so you can run and test without burning credits.

The pricing table#

MODEL_PRICING: dict[str, dict[str, float]] = {
    "claude-haiku-4-5-20251001": {
        "input": 0.80, "output": 4.00,
        "cache_read": 0.08, "cache_write": 1.00,
    },
    "gpt-4o": {
        "input": 2.50, "output": 10.00,
        "cache_read": 0.00, "cache_write": 0.00,
    },
    "gemini/gemini-2.5-flash": {
        "input": 0.075, "output": 0.30,
        "cache_read": 0.00, "cache_write": 0.00,
    },
    ...
}

Extended from Chapter 01 to cover all three provider families. OpenAI and LiteLLM models have cache_read: 0.00: they either don't support explicit caching or handle it opaquely. Only Anthropic exposes cache token counts you can measure.

The dynamic boundary in _call_anthropic#

The system prompt is the same across all calls: it defines the agent's identity, tools, and constraints. Caching it means writing once and reading cheaply. The user prompt changes every call; caching it means paying write cost on every call with zero hits.

Key block:

system_blocks = [{
    "type": "text",
    "text": system,
    "cache_control": {"type": "ephemeral"},  # cached
}]

messages = [{
    "role": "user",
    "content": [{
        "type": "text",
        "text": prompt,
        # No cache_control: this is the dynamic boundary.
    }],
}]

[full: swarm/core/client.py:60-120]

The dynamic boundary in diagram form:

CACHED (written once, read many)
  - system prompt (role definition)
  - static_doc (large context)
---------------------------------
NOT CACHED (changes every call)
  - user prompt
  - tool results

In Chapter 06 (fork-join orchestration), multiple workers share a large cached document, cached once, read at 10% of input cost by every worker.

Two technical requirements for Anthropic caching to activate:

  1. The content block must be at least 1024 tokens (Haiku) or 2048 tokens (Sonnet/Opus). Below this threshold, Anthropic doesn't cache even if you set cache_control.
  2. You must send the anthropic-beta: prompt-caching-2024-07-31 header.

Warning: Silent cache skip. If your content block is below the minimum threshold, Anthropic silently ignores cache_control, no error, no indication. If cache_creation_input_tokens is 0: (a) check the beta header is set, (b) check content exceeds the threshold, (c) verify content is identical across calls.

The fix is to increase the system prompt length, add more detailed instructions, examples, or context, until it crosses the threshold. A 1024-token system prompt is roughly 750 words of English prose, or about 100 lines of code with comments.1

_call_openai and _call_litellm#

The OpenAI backend changes the URL, auth header, response path, and field names:

headers = {"Authorization": f"Bearer {key}", "content-type": "application/json"}
url = "https://api.openai.com/v1/chat/completions"
text = data["choices"][0]["message"]["content"]
usage = Usage(
    input_tokens=raw_usage.get("prompt_tokens", 0),
    output_tokens=raw_usage.get("completion_tokens", 0),
)

LiteLLM is imported lazily so users who only use Anthropic and OpenAI don't need to install it. It provides a single OpenAI-compatible interface over ~100 providers via litellm.acompletion(model, messages, max_tokens).

call_llm: the chokepoint#

async def call_llm(
    prompt: str, *,
    model: str = "claude-haiku-4-5-20251001",
    system: str = "You are a helpful assistant.",
    max_tokens: int = 1024,
) -> CallResult:
    provider = detect_provider(model)
    if _MOCK:
        return CallResult(...)

    t0 = time.monotonic()
    if provider == "anthropic":
        text, usage = await _call_anthropic(prompt, system, model, max_tokens)
    elif provider == "openai":
        text, usage = await _call_openai(prompt, system, model, max_tokens)
    else:
        text, usage = await _call_litellm(prompt, system, model, max_tokens)

    latency_ms = int((time.monotonic() - t0) * 1000)
    cost = compute_cost(model, usage)
    return CallResult(text=text, usage=usage, model=model, provider=provider,
                      latency_ms=latency_ms, cost_usd=cost)

Twenty-five lines. No caller touches _call_anthropic or _call_openai directly. Every future chapter calls call_llm(). The provider is invisible.


4. Run It#

Mock mode, no API keys required:

SWARM_MOCK=true python code/client.py

Real mode with Anthropic:

Call 1: Claude Haiku (Anthropic backend)
Provider:      anthropic
Text:          Paris
Input tokens:  48
Cache read:    0
Cache write:   48
Latency:       521ms
Cost:          $0.00003840

On the first call, cache_write_tokens is non-zero. Call again immediately:

Cache read:    48
Cache write:   0
Cost:          $0.00000384

The second call reads from cache. Cost dropped 10x.

sequenceDiagram
    participant C as client.py
    participant API as Anthropic API
    participant Cache as Prompt Cache

    C->>API: Call 1: system block with cache_control
    API->>Cache: Write system prompt (48 tokens)
    API-->>C: usage.cache_write_tokens=48, cost=$0.000048

    C->>API: Call 2: same system block
    API->>Cache: Cache hit! Read system prompt
    API-->>C: usage.cache_read_tokens=48, cost=$0.000005

    Note over C,Cache: Cost on call 2 is ~10% of call 1
    Note over C,Cache: Cache TTL ≈ 5 minutes

5. Observe It#

How to read the cost breakdown#

Every CallResult has usage.cache_read_tokens and usage.cache_write_tokens:

  • cache_write_tokens > 0: Anthropic wrote to cache. You paid cache_write price (slightly above normal input).
  • cache_read_tokens > 0: Anthropic read from cache. You paid cache_read price (~10% of input).

Cache TTL is approximately 5 minutes. After expiry, the next call pays cache write price again.

Minimum size for caching#

Anthropic only caches blocks ≥1024 tokens (Haiku) or ≥2048 tokens (Sonnet/Opus). A 6-token system prompt like "You are a helpful assistant." will never be cached: cache_write_tokens stays 0. Verify caching activated by checking cache_write_tokens > 0.

Cost comparison: 100 calls with a 2000-token system prompt on Haiku#

Scenario Per-call input cost 100-call total
No caching 2000 × $0.80/M = $0.0016 $0.16
First call (cache write) 2000 × $1.00/M = $0.002
Calls 2-100 (cache read) 2000 × $0.08/M = $0.00016 $0.01584
With caching (total) $0.01784
Savings 89%

The cache write on call 1 costs slightly more than normal input ($1.00/M vs $0.80/M). From call 2, every cache read is $0.08/M, 10x cheaper. Over 10,000 calls: $160 uncached vs $16 cached.

For a production agentic system with 1M input tokens/day and a 90% cache hit rate on a 2,000-token system prompt:

Workload Daily tokens Rate Daily cost
No caching 1,000,000 $0.80/M $0.80
Cache writes (10%) 100,000 $1.00/M $0.10
Cache reads (90%) 900,000 $0.08/M $0.07
With caching $0.17
Annual savings $229/year

At this scale, caching saves 79% of system prompt costs. With a 10,000-token system prompt (detailed agent with tool definitions), savings scale to ~$1,000+/year per 1M-call/day workload.

The provider field#

CallResult now includes provider. In a system with mixed model calls, knowing which calls hit Anthropic vs. OpenAI vs. LiteLLM is essential for debugging and cost attribution.


6. Break It#

call_llm is one-shot. Call it on a task where each answer depends on the previous and the limitation is immediately visible: each call is an independent HTTP request, so the model in call 3 has never seen calls 1 or 2. It will fail to recall prior context, or worse, hallucinate. (See modules/02_providers/what_goes_wrong.py for the runnable demo.)

Chapter 03 fixes it: the agent loop accumulates a messages list, so every prior turn is included in each subsequent call. call_llm doesn't handle this; the loop does.

Second break: call_llm has no retry logic. During high API load, a 429 Too Many Requests crashes. Chapter 01 Exercise 02 built call_with_retry; Exercise 03 of this chapter builds call_with_fallback. Both are essential for production resilience.

Third break: the cost blindspot. call_llm returns cost_usd on every call, but no caller aggregates it. In production you need cost rolled up by session, user, model, and time period. Chapter 05 addresses this. You can get ahead of it now by wrapping call_llm in a decorator that logs every CallResult to a JSONL file:

async def tracked_call_llm(prompt: str, **kwargs) -> CallResult:
    result = await call_llm(prompt, **kwargs)
    with open("./logs/calls.jsonl", "a") as f:
        f.write(json.dumps({
            "ts": datetime.now().isoformat(),
            "model": result.model,
            "provider": result.provider,
            "cost_usd": result.cost_usd,
            "input_tokens": result.usage.input_tokens,
            "output_tokens": result.usage.output_tokens,
        }) + "\n")
    return result

Start with it in Chapter 02 and you'll have months of cost data when you need to optimize in Chapter 05.

Anti-pattern: Multi-Provider Sprawl

Without a chokepoint, teams add provider integrations ad hoc. One engineer adds Anthropic. Another adds OpenAI. A third adds Gemini. Each uses different field names (usage.input_tokens vs usage.prompt_tokens). Logging is inconsistent. Cost tracking is impossible.

The fix is architectural: mandate that all LLM calls go through one function. It doesn't need to be call_llm, but there must be one. A codebase with three different LLM clients is three times harder to debug, cost-track, and migrate.


7. Cost Attribution and Observability#

Summing cost_usd gives total spend but not the breakdown by provider, model tier, or call type. Chapter 05 is entirely about cost optimization; you need this granularity. Minimal attribution pattern:

@dataclass
class CostAttribution:
    total_usd: float = 0.0
    by_provider: dict[str, float] = field(default_factory=dict)
    by_model: dict[str, float] = field(default_factory=dict)
    cache_write_usd: float = 0.0
    cache_read_usd: float = 0.0
    uncached_usd: float = 0.0
    call_count: int = 0

    def record(self, result: CallResult) -> None:
        self.total_usd += result.cost_usd
        self.by_provider[result.provider] = (
            self.by_provider.get(result.provider, 0.0) + result.cost_usd)
        self.by_model[result.model] = (
            self.by_model.get(result.model, 0.0) + result.cost_usd)
        if result.usage.cache_write_tokens > 0:
            self.cache_write_usd += result.cost_usd
        elif result.usage.cache_read_tokens > 0:
            self.cache_read_usd += result.cost_usd
        else:
            self.uncached_usd += result.cost_usd
        self.call_count += 1

Run 100 calls through this and you see: what fraction is cache writes (setup cost), cache reads (ongoing at 10%), uncached (saving opportunity). High uncached_usd on system-prompt calls means you're below the cache threshold. High cache_write_usd relative to cache_read_usd means the TTL is expiring; call more frequently or accept the overhead.

8. The Cache Hit/Miss Decision Flow#

The decision is more complex than "set cache_control → caching happens."

flowchart TD
    A["Request with cache_control blocks"] --> B{"Beta header set?"}
    B -->|No| C["Normal input pricing"]
    B -->|Yes| D{"Block size ≥ minimum?"}
    D -->|No| E["Silent skip"]
    D -->|Yes| F{"Cache hit within TTL?"}
    F -->|Yes| G["Cache read: $0.08/M"]
    F -->|No| H["Cache write: $1.00/M"]
    E & C --> I["Normal input: $0.80/M"]

Four outcomes:

  1. No beta header: normal input pricing, no caching.
  2. Below threshold: silent skip, normal input pricing. Monitor cache_write_tokens to detect.
  3. Cache miss (first call or after TTL): cache write at $1.00/M (25% above normal input). You're paying to populate the cache.
  4. Cache hit: cache read at $0.08/M (10% of normal input). Target state for high-frequency calls.

For low-frequency operations, a batch job running once an hour, you'll pay cache write on every run with zero reads. Caching only helps if the cache stays warm: roughly one call per 4-5 minutes minimum.


9. Advanced: Structuring Multiple Cache Breakpoints#

You can set cache_control on multiple content blocks. Anthropic caches up to the last block with cache_control. Consider an agent reading a large codebase: small static system prompt, large codebase cached as a document block, dynamic user question uncached.

system_blocks = [{
    "type": "text",
    "text": agent_role,               # ~200 tokens, cached
    "cache_control": {"type": "ephemeral"},
}]

messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": codebase_content,  # ~50k tokens, cached
         "cache_control": {"type": "ephemeral"}},
        {"type": "text", "text": user_question},    # ~50 tokens, NOT cached
    ],
}]

If the codebase matches cache, you pay $0.08/M instead of $0.80/M for 50,000 tokens, $0.004 vs $0.04 per call. At 1,000 calls/day, $32.80/day saved.

This is the Chapter 06 fork-join pattern: multiple workers sharing a single cached document, each asking different questions.

War Story: The $2,000 Caching Bug

A team had a system prompt with 3,000 tokens of detailed instructions. They set cache_control on it and watched cost drop 90% in testing. In production, costs were higher than expected.

The bug: the system prompt included a timestamp ("Current date: {today}") injected fresh on every call. Since the prefix changed on every call, there were no cache hits. The "cached" block was written and immediately expired without ever being read.

The fix was to move the timestamp out of the system prompt into the first user message (appropriately uncached). A single dynamic field in an otherwise-static block kills the entire cache hit rate.


10. Exercises#

Exercise 01: Batch Call (exercises/01_batch_call.py)#

Implement batch_call(prompt, models) -> list[CallResult]. Call the same prompt on all models concurrently using asyncio.gather. On failure, include CallResult(text="ERROR: ...", cost_usd=0.0). Don't let one failure kill the batch. In Chapter 05 you'll build a model router using this comparison data.

Exercise 02: Cache Measurement (exercises/02_cache_measurement.py)#

Implement measure_cache_hits(system_prompt, n_calls). Call the same large system prompt (>1024 tokens) sequentially. Expected: call 1 has cache_write_tokens > 0; calls 2+ have cache_read_tokens > 0 at ~10% the cost. Wait 6 minutes, run again, and watch the cache expire.

Key verification: if cache_write_tokens == 0 on call 1 even with cache_control, your prompt is below 1024 tokens. Pad it until caching activates.

Exercise 03: Provider Fallback (exercises/03_provider_fallback.py)#

Implement call_with_fallback(prompt, models). Try each model in order. On exception, log and continue. Return the first successful CallResult; if all fail, re-raise the last exception. Foundation of budget routing: try cheapest first, escalate only on failure.


11. Summary#

Key takeaways:

  • The chokepoint pattern routes all LLM calls through one function. Callers pass a model name; the function detects provider and dispatches.
  • Provider detection from the model name string is sufficient: "claude-*" → Anthropic, "gpt-*" → OpenAI, everything else → LiteLLM.
  • Anthropic prompt caching requires the anthropic-beta: prompt-caching-2024-07-31 header and a content block of ≥1,024 tokens (Haiku) or ≥2,048 tokens (Sonnet/Opus).
  • If caching is silent-skipped, cache_write_tokens stays 0 and you pay normal input. Always verify cache_write_tokens > 0 after adding cache_control.
  • The dynamic boundary: cache what's static (system prompt, shared documents); don't cache what changes per call (user messages, tool results).
  • Cache reads cost ~10% of normal input ($0.08/M vs $0.80/M on Haiku). At 1,000 calls/day with a 2,000-token system prompt, caching saves ~89% of system prompt costs.
  • Prompt caching is the first form of agent memory, cheap persistence across thousands of calls. The full Augmented LLM (tools + memory + retrieval) is formalized in Chapter 03.
  • The provider field on CallResult enables multi-provider cost attribution, logging, and debugging.

  1. As of 2026: minimum cacheable block is 1,024 tokens for claude-haiku-4-5 and 2,048 tokens for claude-sonnet-4-6 and claude-opus-4-6. Documented at https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching.