Appendix: Routing Research#

The router is the unsexiest cost lever and the most valuable. Send every task to Opus and you pay 20x what you need to. Send every task to Haiku and you fail on the ones that needed Opus. Given a task, which model runs it? This appendix surveys the literature and when each approach is worth its engineering.

Three approaches#

Heuristic routers. swarm/routing/router.py takes this path. A triage LLM reads the task, returns JSON, code maps it to a tier. Pros: no training data, readable rules, one file. Cons: every decision costs a small-model call, and rules drift the moment your task mix shifts. The right starting point and for most teams the only router they will ever need.

Learned routers. swarm/routing/learned_router.py shows the next step. Extract hand-engineered features (length, code-block count, imperative-verb count, question words) and train logistic regression on (task, best_model) pairs logged from production. Pros: zero inference cost per call, interpretable weights, one pickle file. Cons: needs labelled data, drifts silently under distribution shift, and the feature engineering caps the ceiling. LearnedRouter keeps the heuristic as a fallback on low-confidence predictions, so the learned ranker is upside and the heuristic is the floor.

RL routers. A policy observes routing decisions and downstream outcomes (eval score, cost, latency), then updates. FrugalGPT (Chen 2023) is the baseline; 2024-2025 work from LMSys and Stanford adds preference learning. The cost is real: a training loop, a reward model you trust, and enough traffic to learn from. Out of scope here but worth tracking.

What the papers say#

Mixture of Agents (Wang 2024) is about aggregation, not routing, but it validates the claim that small specialized models can match or exceed one large model on many tasks if combined right. FrugalGPT (Chen 2023) is the classic cost-reduction paper: sequential cascade from cheap to expensive with a quality gate between stages. The ideas map cleanly onto a router that tries Haiku first and escalates to Sonnet when confidence is low.

When to train a router#

Train a learned router when you have 500 or more logged (task, best_model) pairs, where "best model" is backed by a real metric (eval score, human rating, downstream success), not just the one someone picked. Below that, noise dominates signal. Above that, you start winning the moment you retire the triage LLM call from the hot path.

Gotchas#

Training-data drift. The 500-example set from Q1 is a bad guide to Q3 if a new tier launched mid-quarter. Retrain on rolling windows (last 90 days) and re-evaluate monthly. LearnedRouter's save/load is cheap on purpose.

Class imbalance. Medium-tier routes dominate: 70-80 percent of tasks are "just use the standard model." Default logistic regression handles this poorly at the tail. Use class_weight="balanced" or oversample the minority before fit(). Otherwise your router routes every task to the dominant class.

Eval pollution. Never train on a task that also appears in your eval set. Obvious in theory, violated in practice every single time. Maintain a holdout of production-like queries that is never part of training data.

Confidence calibration. Raw predict_proba is not well calibrated on small data. A "0.72 confidence" may be right 55 percent of the time. If the threshold matters (it does, because it gates the fallback), run Platt scaling or isotonic regression. Until then, treat it as a knob, not a measurement.