Chapter 1 - Token Mechanics¶
Companion to book/ch01_raw_call.md. Runs top-to-bottom in Google Colab in mock mode with no API key required.
# Clone the repo (skip if already present - Colab keeps files across runs in one session)
import os
if not os.path.exists("crafting-agentic-swarms"):
!git clone https://github.com/TheAiSingularity/crafting-agentic-swarms.git
%cd crafting-agentic-swarms
!pip install -e ".[dev]" --quiet
!pip install matplotlib plotly ipywidgets --quiet
import os
try:
from google.colab import userdata
os.environ["ANTHROPIC_API_KEY"] = userdata.get("ANTHROPIC_API_KEY")
print("Using real API (key from Colab secrets).")
except (ImportError, Exception):
os.environ.setdefault("SWARM_MOCK", "true")
print("Running in mock mode (no API key needed).")
What you'll build here¶
- Measure real input-token counts across five prompt shapes (question, code, long prose, poetry, non-English) and see the chars-per-token ratio shift.
- Compute running dollar cost across ten simulated calls using Haiku pricing, directly from
swarm.core.models.compute_cost. - Visualize the retry timeline baked into
swarm.core.client._with_retryso you know what happens after a 429. - Use mock mode for zero-cost reproducibility; swap in your real API key only when you want to.
1. The five prompts¶
Tokens are not characters. The tokenizer splits based on statistical frequency in its training corpus, so the same number of characters can cost wildly different token counts depending on the input. We'll measure this.
import asyncio
from swarm.core.client import call_agent
PROMPTS = {
"question": "What is the capital of France?",
"code": "def fib(n):\n return n if n < 2 else fib(n-1) + fib(n-2)",
"prose": ("Tokens are the unit of billing. " * 30).strip(),
"poetry": "Two roads diverged in a yellow wood,\nAnd sorry I could not travel both\nAnd be one traveler, long I stood.",
"non_english": "東京は日本の首都です。人口は約 1400万人です。", # Tokyo is Japan's capital.
}
async def measure(prompts):
rows = []
for label, text in prompts.items():
_, record = await call_agent(
agent_id=f"probe_{label}",
role="general",
task_id="ch01",
system="You are a helpful assistant.",
prompt=text,
model="claude-haiku-4-5-20251001",
max_tokens=64,
)
rows.append({
"label": label,
"chars": len(text),
"tokens": record.input_tokens,
"ratio": len(text) / max(record.input_tokens, 1),
})
return rows
rows = await measure(PROMPTS)
for r in rows:
print(f"{r['label']:12s} chars={r['chars']:4d} tokens={r['tokens']:4d} chars/tok={r['ratio']:.2f}")
Mock mode produces synthetic token counts (input_tokens comes from the fixture, not a tokenizer), so the ratio comparison is illustrative rather than literal. In real-API mode the same code reports the tokenizer's actual counts.
2. Chars per token - visual¶
import matplotlib.pyplot as plt
labels = [r["label"] for r in rows]
ratios = [r["ratio"] for r in rows]
fig, ax = plt.subplots(figsize=(8, 4))
bars = ax.bar(labels, ratios, color=["#3b82f6", "#10b981", "#f59e0b", "#ef4444", "#8b5cf6"])
ax.set_ylabel("chars / token")
ax.set_title("Tokenizer efficiency across prompt shapes")
ax.axhline(4.0, color="#6b7280", linestyle="--", linewidth=1, label="4.0 English baseline")
ax.legend(loc="upper right")
for bar, val in zip(bars, ratios):
ax.text(bar.get_x() + bar.get_width()/2, val, f"{val:.2f}", ha="center", va="bottom")
plt.tight_layout()
plt.show()
The baseline you learn in Anthropic's guide - roughly 4 chars per token - holds for English prose. Code and non-English drop that number (more unique tokens per character), which means your cost at a fixed character count goes up when your input is not English prose.
3. Cost growth across 10 calls¶
from swarm.core.models import compute_cost
from swarm.core.records import CallRecord
# Simulate a session of 10 calls, each averaging 600 input tokens + 400 output tokens.
costs = []
running = 0.0
for i in range(10):
rec = CallRecord(
agent_id="probe",
role="worker",
model="claude-haiku-4-5-20251001",
task_id=f"turn_{i}",
input_tokens=600,
output_tokens=400,
)
per_call = compute_cost(rec)
running += per_call
costs.append(running)
print(f"Per-call cost (Haiku): ${per_call:.6f}")
print(f"Total after 10 calls: ${running:.6f}")
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(range(1, 11), costs, marker="o", linewidth=2, color="#ef4444")
ax.fill_between(range(1, 11), costs, alpha=0.1, color="#ef4444")
ax.set_xlabel("call #")
ax.set_ylabel("cumulative cost (USD)")
ax.set_title("Running cost across a 10-call session (Haiku)")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Cost growth is linear here because we set the per-call token mix to a constant. In a real ReAct loop (Chapter 3) the mix grows per-iteration as the message history gets re-sent on every call, and the line curves upward.
4. Retry timeline¶
swarm.core.client._with_retry uses exponential backoff: immediate, then +1s, +2s, +4s, capped at 30s, plus up to 0.5s of jitter per retry. Here's the planned schedule for four attempts.
import random
from swarm.core.client import RETRY_MAX_ATTEMPTS, RETRY_BASE_DELAY, RETRY_MAX_DELAY
def retry_schedule(seed=0):
random.seed(seed)
t = 0.0
points = [("attempt 1", 0.0)]
for attempt in range(RETRY_MAX_ATTEMPTS - 1):
delay = min(RETRY_BASE_DELAY * (2 ** attempt) + random.uniform(0, 0.5), RETRY_MAX_DELAY)
t += delay
points.append((f"attempt {attempt + 2}", t))
return points
points = retry_schedule(seed=42)
for name, ts in points:
print(f"{name}: t = {ts:.2f}s")
fig, ax = plt.subplots(figsize=(8, 2.5))
times = [p[1] for p in points]
labels = [p[0] for p in points]
ax.scatter(times, [0] * len(times), s=200, color="#3b82f6", zorder=3)
for t, label in zip(times, labels):
ax.annotate(label, (t, 0), xytext=(0, 15), textcoords="offset points", ha="center")
ax.hlines(0, 0, max(times) * 1.05, colors="#9ca3af", linewidth=1, zorder=1)
ax.set_yticks([])
ax.set_xlabel("elapsed time (s)")
ax.set_title("Retry timeline - exponential backoff with jitter")
ax.set_ylim(-1, 1)
plt.tight_layout()
plt.show()
The first attempt fires immediately. If it throws a retryable error (429, 5xx, timeout), we wait ~1s, then ~2s, then ~4s, each plus up to 0.5s jitter. By the fourth attempt we have spent roughly seven seconds if every try fails. That budget matters when you are deciding where to drain your retries: client-side here, or upstream at the load balancer.
5. Real-API gate (off by default)¶
If you want to run this against the real API, set ANTHROPIC_API_KEY in Colab Secrets and re-run from the top. The block below only fires when mock mode is off.
if os.environ.get("SWARM_MOCK") != "true":
_, real_record = await call_agent(
agent_id="real_probe",
role="general",
task_id="ch01",
system="You are a helpful assistant.",
prompt="What is the capital of France? Answer in one word.",
model="claude-haiku-4-5-20251001",
max_tokens=16,
)
print(f"Real call: {real_record.input_tokens} in, {real_record.output_tokens} out, ${real_record.cost_usd:.8f}")
else:
print("Skipped (mock mode).")
6. Interactive: latency at 99th percentile¶
Production retry budgets get sized around the 99th percentile latency. Below, we simulate 1000 calls whose base latency is log-normally distributed, and compare the P50, P95, P99 tails. The retry budget above (~7s) must sit above the P99 so your retries reliably catch tail failures without blocking the happy path.
import numpy as np
rng = np.random.default_rng(seed=0)
samples = rng.lognormal(mean=6.0, sigma=0.5, size=1000) # ms
print(f"P50={np.percentile(samples, 50):.0f}ms P95={np.percentile(samples, 95):.0f}ms P99={np.percentile(samples, 99):.0f}ms")
fig, ax = plt.subplots(figsize=(8, 3.5))
ax.hist(samples, bins=40, color="#6366f1", alpha=0.85)
for pct, color in [(50, "#10b981"), (95, "#f59e0b"), (99, "#ef4444")]:
x = np.percentile(samples, pct)
ax.axvline(x, color=color, linestyle="--", label=f"P{pct}={x:.0f}ms")
ax.set_xlabel("latency (ms)")
ax.set_ylabel("count")
ax.set_title("Simulated latency distribution, 1000 calls")
ax.legend()
plt.tight_layout()
plt.show()
7. Token cost of static documents¶
Prompt caching is a cost lever for the part of the prompt that doesn't change. How much static content is worth caching? We plot break-even point vs. repeats for various document sizes.
from swarm.core.models import MODEL_PRICING as MP
p = MP["claude-sonnet-4-6"]
doc_sizes = [500, 1500, 3000, 6000, 12_000]
repeats = np.arange(1, 11)
fig, ax = plt.subplots(figsize=(9, 4))
for size in doc_sizes:
no_cache = size * p["input"] * repeats / 1_000_000
with_cache = (size * p["cache_write"] + (repeats - 1) * size * p["cache_read"]) / 1_000_000
ax.plot(repeats, with_cache / no_cache, marker="o", label=f"{size} tok doc")
ax.axhline(1.0, color="#6b7280", linestyle="--", linewidth=1)
ax.set_xlabel("number of repeated calls")
ax.set_ylabel("with_cache / no_cache ratio")
ax.set_title("Caching break-even vs document size")
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
Every curve crosses 1.0 (break-even) at about 2 repeats, regardless of document size. The bigger the doc, the bigger the absolute dollar savings per repeat. Tiny prompts that are called once do not benefit from caching - the write premium dominates.
Takeaways¶
- Input tokens, not characters, are the billing unit. Non-English and code cost more per character.
- Cost is linear per call if the mix is fixed; loops change that (Chapter 3).
- Retries eat a known time budget - plan your timeouts upstream so retries don't cascade.
- Tail-percentile latency drives retry budget sizing.
- Prompt caching breaks even at 2 repeats for any document size.