Chapter 01: The Raw Call#

In this chapter: - What actually happens when you call client.messages.create(), down to the HTTP wire - How tokens are counted, why they're the unit of billing, and how to compute cost from the usage fields - Anthropic's three design principles for agents (Simplicity, Transparency, ACI) and why this book is structured around Simplicity - What stays constant as LLMs evolve and what doesn't: the invariant agent loop vs. the volatile model names - Why the book uses three layers to refer to models, and what each layer tells you

Quick setup check, run these before continuing:
python3 -c "import httpx; print(httpx.__version__)"
export ANTHROPIC_API_KEY=sk-ant-...  # your key here

1. Motivation#

You just ran pip install anthropic and typed:

import anthropic
client = anthropic.Anthropic()
message = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}],
)
print(message.content[0].text)

It worked. But what actually happened? What's in usage? Why does price vary between calls? What is cache_creation_input_tokens? If your production system breaks at 3am with anthropic.APIStatusError: 529 Overloaded, how do you debug it?

The Anthropic SDK is a thin wrapper over one HTTP POST request. The SDK constructs a JSON body, sets three HTTP headers, sends a request, parses the JSON response, and hands you a Python object. That's it.

Every SDK, every LiteLLM abstraction, every agent framework bottoms out at this one HTTP call.¹ Understand this chapter, and you understand the foundation. The rest is plumbing.

Historical context: from completion to conversation#

Early LLM APIs (GPT-3's Completions API, 2020) had no concept of roles: instructions and content competed inside a single text string. The shift to structured conversation turns (OpenAI's Chat Completions in 2023, Anthropic's Messages API the same year) was not cosmetic. The model is trained on the structure, and proper role alternation produces reliably better instruction-following. Anthropic's key design choice, making the system prompt a top-level field rather than a message role, means Claude treats system instructions as an authoritative layer separate from user content.

Under the Hood: Why the System Prompt is a Top-Level Field

OpenAI puts the system instruction as a message with role: "system". Anthropic puts it as a separate top-level system field. OpenAI's design treats the system message as the first turn in the conversation; it goes through the same attention mechanism as every other message, and users can construct adversarial inputs that try to override it.

Anthropic's design makes the system prompt structurally distinct. It is a separate parameter, and Claude models are fine-tuned to give system-field instructions higher weight than user-turn instructions. Put your agent's core instructions, constraints, and tool definitions in the system field, not as a user message. The model will follow them more reliably.

What the SDK actually does#

The Anthropic Python SDK is ~8,000 lines, mostly retry logic, streaming support, and type definitions. The actual HTTP call is about 30 lines. Key block:

body = {"model": model, "max_tokens": max_tokens, "messages": messages}
if system is not None:
    body["system"] = system
body.update(kwargs)

headers = {
    "x-api-key": self.api_key,
    "anthropic-version": self.default_version,
    "content-type": "application/json",
}

response = self._http_client.post(
    f"{self.base_url}/v1/messages", json=body, headers=headers,
)
return Message(**response.json())

The Message you get back is a Python dataclass populated from JSON. message.content[0].text is just response.json()["content"][0]["text"].

When things go wrong, bypass the SDK and use curl or httpx directly to see the raw JSON. The abstraction helps when things work; it obscures what's happening when they don't.

2. First Principles#

The endpoint#

Every call to Claude goes to one URL:

POST https://api.anthropic.com/v1/messages

Stateless. Each request is independent. If you want conversation history, you include it in the request body.

The request body#

The minimum viable request body:

{
  "model": "claude-haiku-4-5-20251001",
  "max_tokens": 1024,
  "messages": [{"role": "user", "content": "What is the capital of France?"}]
}

Four fields to know:

model: which model to run. Determines speed, capability, and price. "gpt-4o" will 404.
max_tokens: hard ceiling on output. The model will stop even mid-sentence. Caps per-call billing exposure.
messages: list of {role, content} objects with role "user" or "assistant". To send conversation history, include every prior turn. There is no memory at the API level, only what you send.
system (optional): a string outside the conversation that tells the model how to behave. Think of it as a persistent instruction users can't override.

Three required headers:

x-api-key: sk-ant-...
anthropic-version: 2023-06-01
content-type: application/json

The anthropic-version header pins the API contract. With it, you get deterministic behavior even as Anthropic ships changes.²

As a curl command:

curl https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{
    "model": "claude-haiku-4-5-20251001",
    "max_tokens": 1024,
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

The SDK produces exactly this, just without the formatting hassle.

The response body#

{
  "id": "msg_01XFDUDYJgAACzvnptvVoYEL",
  "type": "message",
  "role": "assistant",
  "content": [{"type": "text", "text": "Paris"}],
  "model": "claude-haiku-4-5-20251001",
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 42,
    "output_tokens": 1,
    "cache_read_input_tokens": 0,
    "cache_creation_input_tokens": 0
  }
}

Fields that matter most:

content: list of typed blocks. Right now you'll see "text", but tool calls produce "tool_use" blocks (Chapter 03). Always iterate and filter by type; never assume content[0] is what you want.
usage: the token counts. These determine your bill.
stop_reason: "end_turn" means the model finished naturally. "max_tokens" means you hit the ceiling and the response is truncated. "stop_sequence" means the model produced a sentinel string you specified.
model: the model that actually ran.

sequenceDiagram
    participant Y as Your Code
    participant SDK as Anthropic SDK
    participant API as api.anthropic.com

    Y->>SDK: client.messages.create(model, messages, max_tokens)
    SDK->>SDK: Build JSON body + headers
    SDK->>API: POST /v1/messages
    API->>API: Tokenize, run inference
    API-->>SDK: 200 OK { id, content, usage, stop_reason }
    SDK-->>Y: Message(content=[TextBlock(text="Paris")], usage=Usage(...))

What tokens actually are#

Tokens are the unit of computation for language models: not characters, not words, though loosely correlated with both. A token is a chunk from the model's vocabulary: a common word, a prefix, a suffix, or a punctuation mark. The tokenizer splits your input into these chunks before the model sees it.

For English text: 1 token is roughly 4 characters or 0.75 words, but this varies significantly. Common English words tokenize to 1 token ("the", "cat", "Paris"). Less common words may split: "tokenizer" becomes "token" + "izer" (2 tokens). Numbers and symbols are often 1 token per character. Non-Latin scripts (Chinese, Arabic) may have fewer characters per token. Code often has more tokens per character than prose.

graph LR
    subgraph "Input string"
        A["What is the capital of France?"]
    end
    subgraph "Tokenizer output"
        B["What"] --> C[" is"] --> D[" the"] --> E[" capital"] --> F[" of"] --> G[" France"] --> H["?"]
    end
    subgraph "Model sees"
        I["[1842] [318] [262] [3139] [286] [4881] [30]"]
    end
    A --> B
    H --> I

The model never sees characters. It sees integer token IDs. The tokenizer converts text to integers on input and integers back to text on output. Everything in between operates on dense numerical vectors derived from these IDs.

Exercise 01 has you measure the token/character ratio empirically across 10 prompts of varying content.

Under the Hood: Anthropic's Tokenizer vs. tiktoken

Anthropic's tokenizer is proprietary and differs from OpenAI's tiktoken. Same conceptual approach (byte-pair encoding), different vocabulary. For production billing-sensitive operations, do not use tiktoken to estimate Anthropic counts. Use the /v1/messages/count_tokens endpoint: a free API call that returns exact counts before you commit to the main call. The "1 token ≈ 4 characters" rule is useful for back-of-envelope estimates and nothing more.³

The pricing formula#

Anthropic charges per million tokens:

cost = (input_tokens × input_price_per_M
      + output_tokens × output_price_per_M
      + cache_read_tokens × cache_read_price_per_M
      + cache_write_tokens × cache_write_price_per_M) / 1_000_000

Current prices as of 2026⁴ (verify at anthropic.com/pricing before production use):

Tier	Model	Input/M	Output/M	Cache Read/M	Cache Write/M
Small, fast	claude-haiku-4-5-20251001	$0.80	$4.00	$0.08	$1.00
Mid-tier reasoning	claude-sonnet-4-6	$3.00	$15.00	$0.30	$3.75
Premium reasoning	claude-opus-4-6	$15.00	$75.00	$1.50	$18.75

Cache reads are roughly 10x cheaper than normal input: pay once to store a large system prompt, then read it cheaply on every subsequent call. Caching is implemented in Chapter 02.

Sidebar: What Will Change vs. What Won't

The space is evolving fast. Here's what's stable:

Won't change: - The agent loop (observe, plan, act, observe). This is 40-year-old control theory. - The HTTP request/response pattern. Stateless JSON is the consensus interface. - Token-based pricing. The economics may shift; tokens-as-units is fundamental to transformers. - Handling rate limits, retries, failures. Networks are unreliable. True since ARPANET.

Will definitely change: - Model names. claude-haiku-4-5-20251001 is today's. By the time you read this there may be a -20261001. - Specific token prices. Down 100x in two years. Your estimates from today may be off by 10x in two years. - Frameworks. LangChain, AutoGen, CrewAI, and the framework of the week will churn. - Context window sizes. 200k was a breakthrough in 2024. The limits keep moving.

Depend on the loop, not the model. Model names are configuration values, not hard-coded strings.

Sidebar: The Three-Layer Version Convention

This book refers to models three ways.

In prose: "a small fast model" or "a frontier reasoning model." Prose describes capability tiers, which are stable over years.

In footnotes: the current example, "Claude Haiku 4.5, GPT-4o-mini as of 2026."⁵ Footnotes commit to a real thing you can verify today. When a newer model supersedes the footnoted one, the footnote is still historically accurate.

In code: pinned version strings like "claude-haiku-4-5-20251001". Code must be deterministic. When you run the repo's code, it should produce the same results as when it was written.

Why tokens and not characters?#

Attention cost scales as O(n²) in sequence length. Characters produce sequences ~4× longer than tokens for English, making attention 16× more expensive. Words require a vocabulary of 170,000+ entries, too large for an embedding table. Tokens hit the sweet spot: 32,000-100,000 subword units cover almost all text efficiently.

The tokenizer starts with individual bytes and iteratively merges frequent pairs into larger chunks. Common words become single tokens; rare words split into pieces. For exact counts in billing-sensitive operations, use /v1/messages/count_tokens.

3. Anthropic's Design Principles and This Course's Structure#

Anthropic has articulated three principles for designing effective agents. Their implications run through every architectural decision in this book.

Simplicity: start with the simplest architecture that could work. Add complexity only when a simpler approach demonstrably fails. A raw API call is simpler than an agent loop. An agent loop is simpler than a multi-agent swarm. The progression raw call → loop → tools → memory → orchestration → production is about adding the minimal complexity required to handle each new class of problem.

Transparency: at every level, the system should be inspectable. You should see what prompt was sent, what tokens were consumed, what cost was incurred, what decision was made. This is why we build our own CallResult dataclass with explicit fields, rather than rely on opaque SDK return values. An agent you can't inspect is an agent you can't improve.

ACI (Agent-Computer Interface): the boundary between the agent and the tools it uses matters enormously. A poorly designed tool interface causes as many failures as a poorly designed prompt. When we build tool definitions in Chapter 03, we'll spend as much time on the interface as on the implementation, because the model uses the interface description, not the code.

This entire course is structured around Simplicity. Each chapter adds one layer of complexity motivated by a concrete failure of the simpler approach.

The Simplicity Ethos (and Its Anti-Pattern)

Every abstraction layer has a cost: cognitive overhead, debugging complexity, failure modes. You earn the right to add a layer by demonstrating that the previous layer fails for your use case.

The most common mistake is the inverse: reaching for a framework before understanding the primitive. A developer copies 50 lines of LangChain boilerplate, ships, and six months later can't explain why it costs $400/day. Frameworks save real time, but you need the foundations first.

The "Break It" section in every chapter motivates the next chapter. Complexity is always earned.

4. Build It#

Open code/raw_call.py. The pricing dict is a plain dictionary keyed by model ID. If a model isn't in it, compute_cost() returns 0.0: safe, but your cost tracking will be wrong. In production, raise or log a warning.

Key dataclasses:

@dataclass
class Usage:
    input_tokens: int = 0
    output_tokens: int = 0
    cache_read_tokens: int = 0
    cache_write_tokens: int = 0

@dataclass
class CallResult:
    text: str
    usage: Usage
    model: str
    latency_ms: int
    cost_usd: float

Usage mirrors the API response fields. CallResult is what we return to callers, everything you want to log, display, or aggregate. latency_ms isn't in the API response; we measure it with time.monotonic() (monotonic clock, immune to system clock adjustments, preferred for measuring elapsed time).

The cost function multiplies each usage field by its per-million price and rounds to 8 decimal places, enough precision to track fractions of a cent across millions of calls. [full: modules/01_raw_call/code/raw_call.py:40-80]

The HTTP call uses httpx.AsyncClient(timeout=60.0). The 60-second timeout is because Anthropic's API can be slow on long outputs and the default is too short for production. response.raise_for_status() converts any 4xx/5xx into an httpx.HTTPStatusError callers can catch specifically.

Response parsing handles multi-block responses:

text = "\n".join(
    block["text"] for block in data["content"] if block["type"] == "text"
).strip()

This iterates all content blocks, filters for type "text", and joins them. It handles multi-block responses correctly and is forward-compatible with tool use, where non-text blocks appear in the same list.

Cache token fields are aliased internally because the API names (cache_read_input_tokens, cache_creation_input_tokens) differ from our dataclass fields (cache_read_tokens, cache_write_tokens). The or 0 guards against null returns when caching isn't active. These fields won't show non-zero values until Chapter 02, but parsing them now means cost tracking will be correct when caching is enabled.

5. Run It#

python code/raw_call.py

Expected output:

============================================================
Part 1: Normal call to Claude Haiku
============================================================
Text:          Paris
Model:         claude-haiku-4-5-20251001
Input tokens:  42
Output tokens: 1
Cache read:    0
Cache write:   0
Latency:       487ms
Cost:          $0.00003400

Manual cost check. The prompt was "What is the capital of France? Answer in one word." with system prompt "You are a helpful assistant.", 42 input tokens and 1 output token on Haiku:

cost = (42 × $0.80/M) + (1 × $4.00/M) = $0.0000336 + $0.000004 = $0.0000376

The 42 input tokens include:

System prompt ("You are a helpful assistant.", ~6 tokens)
API formatting overhead tokens the model adds internally
User message ("What is the capital of France? Answer in one word.", ~14 tokens)
Structured message formatting tokens

The API exposes only the total, not the breakdown. Exercise 01 measures token counts empirically across various input sizes.

6. Observe It#

Run the call 10 times and watch the latency column.

The first call is slower. DNS, TCP, warm-up. Haiku latency typically ranges 300ms-1500ms depending on output length and server load.

Cost per call is tiny on Haiku. A simple Q-A costs ~$0.00004. At 1M calls/day, that's $40/day, $14,600/year. The Haiku/Sonnet/Opus cost ratio (~1:4:20) is the foundation for the Pareto frontier analysis in Chapter 05.

Output token count varies. Even for "Paris", the model may return "Paris." or "Paris\n", different token counts. Set max_tokens conservatively for billing predictability.

graph TD
    subgraph "What counts as input tokens?"
        A["System prompt text"]
        B["API formatting overhead"]
        C["User message text"]
        D["Conversation history"]
        E["Total input_tokens in usage"]
        A --> E
        B --> E
        C --> E
        D --> E
    end
    subgraph "Your bill"
        H["input_tokens × $0.80/M"]
        I["output_tokens × $4.00/M"]
        J["Total cost_usd"]
        E --> H
        I --> J
        H --> J
    end

War Story: The $0.00004 That Adds Up

A team building a customer support bot found they were making six LLM calls per interaction: classify intent, extract entities, retrieve context, draft response, check safety, reformat. All on Sonnet. Each interaction cost ~$0.003.

Cheap, until you have 100,000 interactions/day. $300/day. $109,500/year. Rerouting the classify and safety calls (which need less capability) to Haiku cut costs 40% with no measurable quality drop.

Cost optimization starts with the per-call mental model, not the aggregate bill. Build the habit of thinking "$X per call at Y calls/day" before you hit scale.

7. Break It#

Try calling an OpenAI model against the Anthropic endpoint:

bad_result = await call_claude("Hello", model="gpt-4o")

The API returns 400 Bad Request. response.raise_for_status() converts this to an httpx.HTTPStatusError. Your Python is correct; four things are wrong:

Wrong URL: GPT-4o lives at https://api.openai.com/v1/chat/completions, not /v1/messages.
Wrong auth: OpenAI uses Authorization: Bearer sk-..., not x-api-key.
Wrong request shape: OpenAI's body differs from Anthropic's (no top-level system field).
Wrong response shape: OpenAI returns choices[0].message.content, not content[0].text.

The raw HTTP code works for exactly one provider. Chapter 02 solves this: a call_llm() function that normalizes all four differences behind a stable interface, detecting the provider from the model name.

8. Error Handling and Production Readiness#

HTTP error codes you'll encounter#

400 Bad Request: malformed body or invalid field. Wrong model name, empty messages list, max_tokens too high. Inspect the error body.
401 Unauthorized: missing or invalid API key. Revoked keys also return 401, not 403.
403 Forbidden: valid key lacks permission. Usually org-level features.
422 Unprocessable Entity: syntactically valid JSON but semantically invalid. A messages list starting with assistant returns this; user/assistant alternation is required.
429 Too Many Requests: rate limit exceeded. Response includes retry-after in seconds. Retry with exponential backoff.
500 Internal Server Error: retry with backoff. Log the request-id header.
529 Overloaded: Anthropic's custom "too much load" status. Retry with backoff.

The retry decision#

4xx errors (except 429) are your fault: fix the request, don't retry. 5xx errors and 429 are transient: retry with exponential backoff (1s, 2s, 4s, 8s), up to 4 attempts, then re-raise.

Error bodies have this structure:

{"type": "error", "error": {"type": "invalid_request_error", "message": "..."}}

error.type categorizes: invalid_request_error (400/422), authentication_error (401), permission_error (403), rate_limit_error (429), api_error (500). Via the SDK these become typed exceptions (anthropic.BadRequestError, etc.).

9. Exercises#

Exercise 01: Token Counter (`exercises/01_token_counter.py`)#

Call call_claude() with 10 prompts of increasing length. Record usage.input_tokens and prompt character length; compute chars_per_token for each. Expected: English prose 3.5-4.5 chars/token; code has more tokens per character; very short prompts have higher overhead from fixed API formatting tokens.

Exercise 02: Retry Budget (`exercises/02_retry_budget.py`)#

Implement call_with_retry() around call_claude() that retries on 429 and 5xx with exponential backoff (1s, 2s, 4s, then re-raise). Test with MockHTTPClient, a fake client that fails N times before succeeding.

Exercise 03: Multi-Model Comparison (`exercises/03_multi_model.py`)#

Implement compare_models(prompt) -> dict[str, CallResult] that calls Haiku and Sonnet with the same prompt. Sonnet costs ~4x more for nearly the same answer on simple prompts. Whether that's worth it depends on your quality metric (Chapter 05 formalizes this).

10. Summary#

Key takeaways:

Every Anthropic SDK call is one HTTP POST to https://api.anthropic.com/v1/messages. JSON in, JSON out, three required headers.

Tokens are the unit of billing. Input tokens include system prompt, conversation history, and API formatting overhead.

The usage field has input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens. Track all four.

Anthropic's tokenizer differs from tiktoken. Use /v1/messages/count_tokens for exact counts.

The anthropic-version header pins the API contract. Always set it.

Three design principles: Simplicity, Transparency, ACI. Each chapter earns its complexity by demonstrating a failure of the simpler approach.

The agent loop won't change. Model names will. Build so model names are configuration, not code.

Three-layer version convention: prose (capability tiers), footnotes (current example), code (pinned strings).

For an overview of how popular agent frameworks (LangChain, LangGraph, CrewAI, AutoGen, Agno) build on top of this primitive, see Appendix A. ↩
The anthropic-version value "2023-06-01" was current as of publication. Check https://docs.anthropic.com/en/api/versioning for the latest stable version. Existing callers specifying an older version header continue to work, that's the point of versioning. ↩
Anthropic's tokenizer is proprietary. Use /v1/messages/count_tokens for exact counts. The "1 token ≈ 4 characters" rule can deviate significantly for code, non-Latin scripts, and very short strings. ↩
Model IDs and prices as of April 2026. Token prices have trended down ~10x every 2 years. Tier ratios (small ≈ 4-5x cheaper than mid, mid ≈ 5x cheaper than premium) are more stable than the absolute numbers. ↩
Where this book says "a small fast model" in prose, the reference implementation uses Claude Haiku 4.5 (claude-haiku-4-5-20251001) as of 2026. GPT-4o-mini is the analogous tier on OpenAI's side. ↩