Chapter 07 — Cost-Aware Routing and Compaction¶
Companion to book/ch07_*.md. Runs top-to-bottom in Google Colab in mock mode with no API key required.
import os
if not os.path.exists("crafting-agentic-swarms"):
!git clone https://github.com/TheAiSingularity/crafting-agentic-swarms.git
%cd crafting-agentic-swarms
!pip install -e ".[dev]" --quiet
!pip install matplotlib plotly ipywidgets --quiet
import os
try:
from google.colab import userdata
os.environ["ANTHROPIC_API_KEY"] = userdata.get("ANTHROPIC_API_KEY")
print("Using real API (key from Colab secrets).")
except (ImportError, Exception):
os.environ.setdefault("SWARM_MOCK", "true")
print("Running in mock mode (no API key needed).")
What you'll build here¶
- Classify a synthetic workload into SMALL, MEDIUM, LARGE tiers.
- Tweak a routing threshold with a live ipywidgets slider and watch cost change.
- Compare the 5 compaction strategies on a single fixture conversation.
- Read the total cost breakdown by tier on a plotly bar chart.
1. The routing problem¶
Most real workloads have a long tail of trivial questions and a short head of complex ones. Sending every query to the largest model is correct but expensive. A small classifier up front picks the cheapest model that can handle each task; over a week that pays for an engineer.
2. Enable ipywidgets in Colab¶
Colab needs the custom widget manager enabled before any interactive widget renders. Locally this import raises, so we swallow the ImportError.
try:
from google.colab import output
output.enable_custom_widget_manager()
print("Colab custom widget manager enabled.")
except ImportError:
print("Not in Colab — skipping widget manager setup.")
3. Load the router¶
We use the TriageRouter from modules/09_routing_compaction/code/routing.py, which classifies each task into SMALL / MEDIUM / LARGE and assigns a Claude tier.
import sys
from pathlib import Path
sys.path.insert(0, str(Path("modules/09_routing_compaction/code")))
from routing import TriageRouter, ModelTier, TIER_MODELS, ContextCompactor, CompactionStrategy
router = TriageRouter()
print("TIER_MODELS:", {k.value: v for k, v in TIER_MODELS.items()})
4. Synthetic 100-task workload¶
Generate a realistic mix of complexities. We seed the distribution so results reproduce across runs. The weights (55 percent small, 35 percent medium, 10 percent large) match many SaaS support queues.
import random
random.seed(17)
SAMPLES = {
"small": ["What is 2+2?", "Capital of France?", "Is 17 prime?", "Convert 5km to miles."],
"medium": ["Summarise this paragraph: ...", "Rewrite this SQL to use a JOIN.",
"Draft a function that dedupes a list while preserving order."],
"large": ["Design a multi-region event bus with exactly-once semantics.",
"Write a long-form analysis of the Pareto tradeoffs for our deployment."],
}
def sample_workload(n: int = 100) -> list[tuple[str, str]]:
out: list[tuple[str, str]] = []
for _ in range(n):
tier = random.choices(["small", "medium", "large"], weights=[0.55, 0.35, 0.10])[0]
out.append((tier, random.choice(SAMPLES[tier])))
return out
workload = sample_workload(100)
from collections import Counter
print(Counter(t for t, _ in workload))
5. Estimated cost per tier¶
Published April-2026 input and output rates per million tokens. We assume ~500 tokens in, 500 tokens out per task as a representative load. Real numbers will vary; these are the ones students should trust as first-order estimates.
RATES = {
ModelTier.SMALL: {"in": 0.80, "out": 4.00},
ModelTier.MEDIUM: {"in": 3.00, "out": 15.00},
ModelTier.LARGE: {"in": 15.00, "out": 75.00},
}
TOKENS_IN = 500
TOKENS_OUT = 500
def tier_cost(tier: ModelTier) -> float:
r = RATES[tier]
return (TOKENS_IN * r["in"] + TOKENS_OUT * r["out"]) / 1_000_000
for tier in ModelTier:
print(f"{tier.value:7s} ${tier_cost(tier):.4f} / call")
6. Interactive threshold sliders¶
Pretend we are hand-coding the router: send small_frac of the workload to SMALL, medium_frac to MEDIUM, and the remainder to LARGE. Watch the total cost respond live.
import ipywidgets as widgets
from IPython.display import display
import plotly.express as px
import pandas as pd
small_slider = widgets.FloatSlider(value=0.55, min=0.0, max=1.0, step=0.05,
description="small frac")
medium_slider = widgets.FloatSlider(value=0.35, min=0.0, max=1.0, step=0.05,
description="medium frac")
out_box = widgets.Output()
def render(_change=None):
s = small_slider.value
m = medium_slider.value
l = max(0.0, 1.0 - s - m)
n = 100
counts = {ModelTier.SMALL: int(n * s),
ModelTier.MEDIUM: int(n * m),
ModelTier.LARGE: int(n * l)}
totals = {tier.value: counts[tier] * tier_cost(tier) for tier in ModelTier}
df = pd.DataFrame({"tier": list(totals.keys()), "cost_usd": list(totals.values())})
out_box.clear_output(wait=True)
with out_box:
total = sum(totals.values())
print(f"counts={ {k.value: v for k, v in counts.items()} } total=${total:.3f}")
fig = px.bar(df, x="tier", y="cost_usd",
title=f"Workload cost (100 tasks): ${total:.3f}")
fig.show()
small_slider.observe(render, names="value")
medium_slider.observe(render, names="value")
display(small_slider, medium_slider, out_box)
render()
7. Live routing with the actual TriageRouter¶
Run the real classifier on a subset of the workload. In mock mode it defaults every task to MEDIUM; with a live API key the classifier calls Haiku and you see a mix of tiers. That mix is the thing you actually want to measure when choosing thresholds.
sample = workload[:10]
classified = []
for _, task in sample:
c = await router.classify(task)
classified.append(c)
print(f"{c.tier.value:7s} {task[:60]}")
8. Classified tier distribution¶
Count the tier assignments from the real classifier. In mock mode this is all MEDIUM; in live mode you can see whether your threshold heuristics line up with the LLM-based classifier.
tier_counts = Counter(c.tier.value for c in classified)
print(tier_counts)
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(5, 3))
ax.bar(tier_counts.keys(), tier_counts.values(), color="#3d7eff")
ax.set_title("Real classifier tier distribution")
plt.show()
9. Compaction strategies¶
Long conversations exhaust the context window. We need a policy for what to keep when the budget runs low. ContextCompactor gives us five strategies ranging from brutal truncation to LLM summarisation.
compactor = ContextCompactor()
messages = [
{"role": "system", "content": "You are a senior Python engineer."},
{"role": "user", "content": "Set up a fresh project skeleton."},
{"role": "assistant", "content": "Created pyproject.toml and src/."},
{"role": "user", "content": "Add a CLI entry point."},
{"role": "assistant", "content": "Added main.py with argparse."},
{"role": "user", "content": "Wire up logging."},
{"role": "assistant", "content": "Added structlog config in logging_setup.py."},
{"role": "user", "content": "How do I deploy to prod?"},
{"role": "assistant", "content": "Use GitHub Actions to build a docker image then push to ECR."},
{"role": "user", "content": "What was the CLI name again?"},
{"role": "assistant", "content": "It is `myproj`, defined in main.py."},
{"role": "user", "content": "Write a dockerfile."},
]
print(f"Original: {len(messages)} messages, {sum(len(m['content']) for m in messages)} chars")
10. Run all five strategies¶
Every strategy always keeps the system message. The rest of the policy varies. truncate keeps the most recent bytes; rolling_window keeps the most recent N turns; selective keeps only turns matching keywords; index_retrieve scores turns against a query; summarize collapses older turns into one synthetic message.
results = {}
results["truncate(2000)"] = compactor.truncate(messages, max_chars=2000)
results["rolling(4)"] = compactor.rolling_window(messages, window=4)
results["selective"] = compactor.selective(messages, keep_keywords=["deploy", "CLI", "dockerfile"])
results["index(top3)"] = compactor.index_retrieve(messages, query="how to deploy the CLI", top_k=3)
results["summarise(3)"] = await compactor.summarize(messages, keep_last=3)
for name, msgs in results.items():
chars = sum(len(m["content"]) for m in msgs)
print(f"{name:16s} {len(msgs):2d} msgs {chars:5d} chars")
11. Visualise token savings¶
Bar chart shows each strategy's final context size vs the original. summarise gives the best compression per unit of lost fidelity; truncate is the cheapest but drops earliest context; selective needs you to predict the right keywords up front.
labels = list(results.keys()) + ["original"]
sizes = [sum(len(m['content']) for m in msgs) for msgs in results.values()]
sizes.append(sum(len(m['content']) for m in messages))
fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(labels, sizes, color=["#3d7eff"] * len(results) + ["#888"])
ax.set_ylabel("Characters")
ax.set_title("Compaction strategies — size of kept context")
ax.tick_params(axis="x", rotation=30)
plt.tight_layout()
plt.show()
12. Inspect one compacted output¶
Print the summarised conversation. In mock mode we get a canned '[Summary]' placeholder; in live mode the LLM produces an actual summary preserving named entities, decisions, and open questions.
print("=== summarise(3) output ===")
for m in results["summarise(3)"]:
print(f"[{m['role']}] {m['content'][:100]}")
13. Cost breakdown by tier (plotly)¶
Roll up the entire 100-task workload assuming the original tier mix, then plot cost share per tier. This is the dashboard you want on the wall when routing decisions land.
all_costs = {"small": 0.0, "medium": 0.0, "large": 0.0}
for tier_name, _ in workload:
all_costs[tier_name] += tier_cost(ModelTier[tier_name.upper()])
df = pd.DataFrame({"tier": list(all_costs.keys()), "cost_usd": list(all_costs.values())})
fig = px.bar(df, x="tier", y="cost_usd", color="tier",
title=f"100-task workload cost by tier: ${df['cost_usd'].sum():.3f}")
fig.show()
14. What-if: all LARGE vs routed¶
How much does routing actually save? Compare the routed total against sending every single task to the LARGE tier. The gap is the money the triage router earns every month.
all_large = len(workload) * tier_cost(ModelTier.LARGE)
routed = df["cost_usd"].sum()
savings = all_large - routed
print(f"all LARGE: ${all_large:.3f}")
print(f"routed: ${routed:.3f}")
print(f"savings: ${savings:.3f} ({savings / all_large * 100:.1f}%)")
15. What to try next¶
- Add a real API key; rerun cell 7 and see the true classifier distribution.
- Tweak the SMALL/MEDIUM/LARGE boundaries: can you cut another 20 percent without losing quality?
- Replace
index_retrievewith a real embedding search via sentence-transformers. - Wire the compactor into a live ReAct loop and cap the context at 8k tokens.