Chapter 05: Evaluation & Observability#
In this chapter - Why "APPROVE" is a soft signal and what rigorous measurement looks like - How LLM-as-judge works, what biases it has, and how to mitigate them - How to build an eval harness that runs on every commit, in parallel, in seconds - What OpenTelemetry's GenAI attributes give you, and why they matter for vendor portability - How to plot a Pareto frontier across models and pick the right one for your quality target - Why observability is not just operational hygiene but an implementation of the Transparency principle
1. Motivation#
Your Chapter 04 critic said "APPROVE" after round 2. But was the code actually better? You don't know. You're trusting the model to evaluate itself. It may approve bad code because it looks plausible, give a 7/10 to one version and an 8/10 to an identical version in a different conversation, or bias toward closure after N rounds. The "APPROVE" is a soft signal, not a measurement.
This chapter builds the measurement layer: LLM-as-judge with bias awareness, an eval harness you can run every commit, and OpenTelemetry traces with GenAI semantic attributes (the gen_ai.* convention). After this chapter, "better" has a number.
2. First Principles#
What makes a good eval?#
Three properties:
- Validity: it measures what actually matters. An eval that checks whether output contains certain keywords may be measuring verbosity, not correctness.
- Reliability: the same input produces the same verdict (or close). An eval with high variance is noise. You can't detect a 2% quality improvement through 20% evaluation noise.
- Cost: cheap enough to run on every commit, model swap, prompt change. An eval you run once a week is too slow to be useful.
Human evaluators score highly on validity and reliability, but fail on cost. LLM judges score well on cost but have known biases. The practical answer: LLM judge for continuous evaluation, human for final validation.
Position bias#
Zheng et al. (2023, arXiv:2306.05685) quantified a systematic problem: GPT-4 prefers whichever answer appears first roughly 60% of the time, even when answers are swapped in a second evaluation and it prefers the same content in second position.
This is position bias: the model has a prior for "the first answer is better" regardless of quality. The bias is consistent across models and task categories. An evaluator with position bias is measuring presentation order, not quality.
The mitigation: evaluate in both orderings (A vs B, then B vs A) and average the scores. If the winner flips, flag it: position_bias_detected=True. This doubles the cost of pairwise evaluation but converts an unreliable signal into a reliable one.
Verbosity bias#
LLMs also tend to prefer longer answers. A 500-word response to a yes/no question may score higher than a concise "yes" even when the concise answer is correct.
The mitigation is reference-backed scoring: compare the output to a gold standard rather than evaluating in isolation. A one-word "Paris" scores 1.0 against a reference of "Paris". A 200-word explanation scores lower on conciseness rubrics.
Pairwise vs scalar#
- Scalar scoring: give the output 0.0 to 1.0. Simple. Aggregatable. Comparable across runs.
- Pairwise preference: ask "which is better, A or B?" More reliable but can't be directly aggregated.
We use both. judge.score() for continuous eval runs (need numbers). judge.pairwise() for model comparison (need relative ranking).
Callout: The 0.7 threshold and domain calibration
The pass threshold is
score >= 0.7, a reasonable starting point, not a universal law.Domain-dependent: - Code: 0.8 or higher. A false pass risks deploying broken software. - Factual Q&A: 0.75 or higher. Factual errors compound when users act on them. - Creative writing: 0.5 or lower. Quality is subjective. - Summarization: 0.65 typical. Some paraphrase is acceptable.
Calibrate with Exercise 01: compare judge scores to your own manual scores on 10 cases. Compute Pearson correlation. Adjust the threshold until the judge's pass/fail decisions match yours 80%+ of the time.
3. The Intellectual Lineage#
MT-Bench and Chatbot Arena (Zheng et al., 2023)#
"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (Zheng et al., 2023, arXiv:2306.05685) makes three contributions:
MT-Bench: a multi-turn benchmark with 80 questions across 8 categories (writing, roleplay, reasoning, math, coding, extraction, STEM, humanities). LLM judges correlate with human judgment (Spearman 0.85) when biases are controlled.
Position bias quantification: evaluators prefer first-presented answers ~60% of the time regardless of quality. This motivates the double-evaluation strategy in judge.pairwise(). Without double evaluation, roughly 10% of your pairwise comparisons are decided by presentation order rather than quality.
Self-enhancement bias: LLMs rate outputs matching their own style more highly. Cross-provider judges show the same pattern.1 Mitigation: use a different model as judge than the models being evaluated, or use multiple judges and average.
The Elo Rating System for LLMs (Chatbot Arena)#
Chatbot Arena (Chiang et al., 2024, arXiv:2403.04132) extended pairwise evaluation to human preference at scale: users compare two anonymized model outputs and vote. The resulting Elo ratings, the same system used in chess, are the most widely trusted benchmark for conversational AI quality. Aggregated pairwise human preferences converge faster than rubric-based scoring because "which is better?" is an easier judgment than "rate this from 0 to 10."
For your eval harness: include pairwise alongside scalar. Pairwise is more reliable for closely matched models; scalar is easier to aggregate over time.
SWE-bench (Jimenez et al., 2023)#
"SWE-bench: Can Language Models Resolve Real-world GitHub Issues?" (Jimenez et al., 2023, arXiv:2310.06770) is the gold standard for code evaluation. It uses real GitHub issues with real test suites as the pass/fail oracle. No LLM judge. Tests pass or fail.
Whenever you can replace an LLM judge with a verifiable oracle (test suite, type checker, linter, schema validator), do so. Use LLM judges only for judgment calls that can't be automated.
LLM-as-Judge architecture#
graph TD
subgraph "Inputs"
OUT["Candidate Output"]
REF["Reference (optional)"]
RUB["Rubric (optional)"]
end
subgraph "LLMJudge.score()"
PROMPT["Build judge prompt"]
LLM["LLM call (different model)"]
PARSE["Parse score + reasoning"]
end
subgraph "Output"
RESULT["EvalResult\nscore, reasoning, pass"]
end
OUT --> PROMPT
REF -.->|optional| PROMPT
RUB -.->|optional| PROMPT
PROMPT --> LLM --> PARSE --> RESULT
Use a different model as judge than the models being evaluated. Cross-provider judging reduces self-enhancement bias.
4. Build It#
Open code/eval.py.
The EvalCase and EvalResult dataclasses#
@dataclass
class EvalCase:
id: str
input: str
expected_output: str | None = None
tags: list[str] = field(default_factory=list)
metadata: dict = field(default_factory=dict)
expected_output is optional: for reference-backed scoring you have it, for rubric-based scoring you don't. tags let you slice: [r for r in run.results if "math" in case_by_id[r.case_id].tags]. Invest in tags from day one, so you can slice by category as your suite grows.
The judge system prompt:
You are an expert evaluator. Score the AI response on a 1-10 scale.
Criteria:
- Correctness: Does the response accurately address the task?
- Completeness: Are all aspects covered?
- Clarity: Is the response well-structured?
Return ONLY: {"score": <int 1-10>, "rationale": "<one sentence>"}
LLMJudge.score()#
async def score(self, output, *, reference=None, rubric=""):
if reference:
prompt = f"Reference:\n{reference}\n\nCandidate:\n{output}\n\n..."
else:
prompt = f"Output:\n{output}\n\n{rubric}\n\n..."
result = await self._call(prompt, system=_JUDGE_SYSTEM, max_tokens=100)
return self._parse_score_response(result.text)
[full: swarm/eval/harness.py:80-140]
Always return reasoning. When a case fails, a bare score tells you nothing about whether the failure was a hallucination, format error, or factual mistake.
The max_tokens=100 cap is intentional. Judge outputs should be a score and one sentence. If you let the judge reason at length, you're paying for text the harness never reads.
LLMJudge.pairwise(): position bias mitigation#
async def pairwise(self, output_a, output_b, *, prompt) -> dict:
# Forward: A first
score_a_fwd, score_b_fwd = await self._compare_once(
output_a, output_b, prompt=prompt)
# Reversed: B first
score_b_rev, score_a_rev = await self._compare_once(
output_b, output_a, prompt=prompt)
avg_a = (score_a_fwd + score_a_rev) / 2
avg_b = (score_b_fwd + score_b_rev) / 2
position_bias_detected = winner_fwd != winner_rev
Two evaluations, one each ordering. If the winner flips, the judge is position-biased on this pair. Running both and averaging cancels the directional bias. position_bias_detected=True surfaces cases where the candidates are genuinely close and the judge is uncertain.
sequenceDiagram
participant Caller
participant Judge
Caller->>Judge: pairwise(A, B)
Note over Judge: Forward pass: A first
Judge-->>Caller: score_a_fwd=0.7, score_b_fwd=0.3
Caller->>Judge: pairwise(B, A)
Note over Judge: Reversed pass: B first
Judge-->>Caller: score_b_rev=0.6, score_a_rev=0.4
Note over Caller: avg_a=0.55, avg_b=0.45
Note over Caller: winner_fwd=A, winner_rev=B → position_bias_detected
Caller-->>Caller: flag for human review
EvalHarness.run() and .compare()#
async def run(self, model, system) -> EvalRun:
results = list(await asyncio.gather(*[_run_one(c) for c in self.cases]))
All cases run concurrently. At 50 cases, this is the difference between a 5-second eval and a 50-second eval.
The .compare() method returns a regressions list, cases that passed before and now fail, or scores that dropped by more than 0.1. The > 0.1 threshold is calibrated to judge variance (~0.05-0.08 on repeated evaluations of identical inputs). If you use a weaker judge with higher variance, raise to 0.15.
The regressions list is where you catch silent failures. A model swap that improves average score but regresses on 3 specific cases may be unacceptable.
def compare(self, run_a, run_b) -> dict:
regressions = [
{"case_id": cid, "score_a": scores_a[cid], "score_b": scores_b[cid]}
for cid in scores_a
if cid in scores_b and scores_a[cid] - scores_b[cid] > 0.1
]
Observability: Measuring the Process#
The eval harness measures output quality. Observability measures process health: how long did it take, how many tokens, where did it fail? Both are required for production.
The Span and Tracer classes#
@dataclass
class Span:
name: str
attributes: dict = field(default_factory=dict)
def set(self, key, value) -> None: ...
def finish(self) -> None: ...
[full: swarm/observability/tracer.py:20-80]
Self-contained, no OTel SDK required. Attribute names follow the OpenTelemetry Semantic Conventions for Generative AI:
gen_ai.system: "anthropic" | "openai" | "litellm"gen_ai.request.model: the model IDgen_ai.operation.name: the agent rolegen_ai.usage.input_tokens: input token countgen_ai.usage.output_tokens: output token countgen_ai.usage.cost_usd: USD cost (extension, not in the official spec yet)
Using canonical attribute names means you can swap in the real OTel SDK later and your traces will parse correctly in Grafana, Jaeger, Honeycomb, without changing instrumentation code.
Sidebar: Why OTel Semantic Conventions Matter
OpenTelemetry is a CNCF standard for distributed tracing. The "semantic conventions" are the agreed-upon attribute names for common operations. The GenAI conventions (
gen_ai.*) were finalized in 2024 and are now supported by every major observability platform.The practical benefit is vendor portability. Instrument your agent with
gen_ai.system,gen_ai.request.model,gen_ai.usage.input_tokens, and you can switch from Honeycomb to Datadog to Grafana without changing a line of instrumentation code. The collector changes; the spans stay the same.The
gen_ai.usage.cost_usdattribute in this chapter's code is an extension. The spec avoids encoding pricing (it's provider-specific). Add it as a custom attribute.
5. Transparency as Engineering#
Observability is the engineering implementation of the Transparency principle: agents should surface their planning and reasoning rather than acting as black boxes. Every LLM call produces a span with the attributes above. Surface this record: to operators via a dashboard, to users via a "thinking" panel, to yourself via the span log.
graph TB
subgraph "Agent System"
G[Generator] -->|span| T[(Tracer)]
C[Critic] -->|span| T
R[Refiner] -->|span| T
end
subgraph "Observability Stack"
T -->|OTLP| COL[OTel Collector]
COL --> GRAF[Grafana / Jaeger]
COL --> HON[Honeycomb / Datadog]
end
subgraph "User-Facing"
T -->|real-time| PANEL[Thinking Panel]
T -->|post-run| AUDIT[Audit Log]
end
Observability is a feature, not a debugging tool. Add spans before you ship. Users and operators need visibility to trust the system, not just developers trying to diagnose failures.
6. Run It#
Output (mock mode):
── EvalHarness ──────────────────────────────────────
Run ID: a3f2c8d1
Cases: 5
Passed: 4/5
Avg score: 0.7820
Case Score Pass Output
geo-01 0.8200 PASS Mock answer for: What is the cap
math-01 0.8200 PASS Mock answer for: What is 12 × 12
code-01 0.8200 PASS Mock answer for: Write a Python
In real mode, some cases will fail, especially code-01 where the model may return an explanation instead of the bare expression, dropping the score below 0.7.
7. Observe It#
The eval pipeline#
The eval harness is not a one-off script. It's a pipeline that runs on every meaningful change:
flowchart LR
CHANGE[Code / prompt / model change] --> RUN[harness.run]
RUN --> COMPARE[baseline vs candidate]
COMPARE --> REG{Regressions?}
REG -->|Yes| BLOCK[Block deploy]
REG -->|No| PARETO[Plot Pareto]
PARETO --> DECIDE[Pick model on frontier]
Running is cheap (<10s for 20 cases). A regression reaching production is not. Wire it into CI.
The Pareto frontier#
xychart-beta
title "Model Pareto Frontier: Quality vs Cost"
x-axis ["Haiku ($0.001)", "GPT-4o-mini ($0.002)", "Sonnet ($0.005)", "GPT-4o ($0.01)", "Opus ($0.02)"]
y-axis "Avg Eval Score" 0.70 --> 1.00
line [0.78, 0.82, 0.91, 0.89, 0.96]
harness.pareto_point(run) returns (avg_score, total_cost), one point on this chart. Run the same eval with Haiku, Sonnet, Opus. The efficient frontier is the curve where no point is dominated (higher score at lower or equal cost). Sonnet typically sits on it: meaningfully better than Haiku, much cheaper than Opus.
Don't default to the best model. Default to the cheapest model on the efficient frontier for your quality target.
Plotting is one-liner-plus:
import matplotlib.pyplot as plt
runs = {"haiku": haiku_run, "sonnet": sonnet_run, "opus": opus_run}
for name, run in runs.items():
score, cost = harness.pareto_point(run)
plt.scatter(cost, score, label=name)
plt.xlabel("Total cost (USD)"); plt.ylabel("Avg score")
plt.legend(); plt.title("Pareto Frontier"); plt.show()
If Sonnet is northeast of Haiku and southwest of Opus, it's on the efficient frontier.
Callout: Real Pareto frontiers have ≥3 axes
The cost × accuracy plot is a simplification. Real production frontiers have at least three axes: - Cost: total USD per task - Accuracy: eval score - Latency: p95 response time in milliseconds
For some deployments, add more: compliance risk, privacy tier, data residency.
Multi-objective optimization across three or more axes has no single "Pareto optimal" point, only a surface. Practical approaches: Pareto ranking (rank by how many others dominate you); constraint satisfaction (fix latency and cost constraints, maximize accuracy within them); weighted sum (assign weights, compute a scalar; sensitive to weights).
8. The Observability Stack#
Each LLM call is one span:
span = tracer.start_span("generate")
span.set("gen_ai.system", "anthropic")
span.set("gen_ai.request.model", model)
span.set("gen_ai.operation.name", "generate")
span.set("gen_ai.usage.input_tokens", result.usage.input_tokens)
span.set("gen_ai.usage.output_tokens", result.usage.output_tokens)
span.finish()
In production, spans nest: the outer span covers the entire refinement loop; child spans cover each generator, critic, and refiner call. The nesting makes latency analysis possible without parsing log timestamps.
Instrument at two levels:
- Every LLM call:
gen_ai.*attributes, latency, error. - Every logical operation: generator, critic, refiner, eval case, handoff.
Level 1 gives you cost and token tracking. Level 2 gives operational insight: which agents are slowest, which tasks cause critics to take multiple rounds, which eval cases consistently fail. Don't instrument finer unless a specific question requires it. Over-instrumentation floods the backend.
Sidebar: OTel Collector Architecture
The OpenTelemetry collector is a vendor-neutral proxy between your application and your observability backend:
Application → OTLP exporter → Collector → Backend. The collector handles batching, retry, and fan-out. Your app sends tolocalhost:4317; the collector handles everything else.Ship the in-memory
Tracerfor development, replace with the OTel SDK for production:from opentelemetry import trace from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor provider = TracerProvider() provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter())) trace.set_tracer_provider(provider)The canonical
gen_ai.*attributes set in this chapter will parse correctly by any OTel-compatible backend.
9. Break It#
Run your Chapter 04 refinement loop and measure it with this chapter's harness. Two generators, one critic. Which combo produces the best Pareto point?
state_a = await run_refinement_loop(task, model="claude-haiku-4-5-20251001")
state_b = await run_refinement_loop(task, model="claude-sonnet-4-6")
Run both final outputs through judge.pairwise(). Which is better? What does it cost? (See modules/07_eval_observability/what_goes_wrong.py for a demo where the judge scores an empty response 0.7 because the prompt told it to be lenient, a cautionary tale about rubric design.)
Sidebar: Anti-pattern: Eval Harness Theater
An eval harness whose results are never read is not an eval harness. It's theater.
Signs: - The harness runs in CI but no one looks unless a deploy fails - The same 3 failing cases have been in the harness for 6 weeks - The pass threshold was set at project start and has never been revisited - You added the harness to satisfy a compliance requirement, not to catch regressions
The fix: make the output actionable. Wire it into deploy as a hard gate. If
detect_regressions()returns items, the deploy is blocked. If the gate is blocking too many valid deploys, fix the regressions; don't lower the threshold.
Chapter 06 forces this question at N=20 parallel workers: which model produces acceptable quality at the lowest cost? The harness is how you find out.
10. Exercises#
Exercise 01: Calibrate the Judge (exercises/01_calibrate_judge.py)#
Compare LLM judge scores to your own manual scores on 10 cases. Compute Pearson correlation (ranges -1 to +1). scipy.stats.pearsonr(human_scores, llm_scores)[0].
Expected: 0.7-0.9 for factual tasks, 0.5-0.7 for subjective. Below 0.5, the judge is measuring something different from what you care about. Reconsider the rubric.
Use the correlation to set your pass threshold: if the judge consistently scores 0.1 higher than you do, set threshold at 0.8.
Exercise 02: Detect Position Bias (exercises/02_position_bias.py)#
Run judge.pairwise() 20 times on the same A/B pair. Track how often the winner flips when A↔B are swapped.
Expected: 15-25% flip rate on close pairs, 0-5% on clearly different-quality pairs. Above 30% means the judge is not reliable for that comparison. Use a stronger judge model.
Exercise 03: Regression Detector (exercises/03_regression_detector.py)#
Given two EvalRuns, flag cases where score dropped more than threshold. Wire into your deployment: if detect_regressions(baseline, candidate) returns items, block. The threshold=0.1 default catches meaningful regressions (0.9 → 0.79) while ignoring judge variance.
11. Summary#
Key takeaways - LLM judges have position bias (prefer first-presented ~60%) and verbosity bias. Mitigate with double evaluation and reference-backed scoring. - The pass threshold (
score >= 0.7) is a starting point. Calibrate with Exercise 01 by measuring Pearson correlation between judge and manual scores. - Run all eval cases concurrently withasyncio.gather. For 20 cases this is the difference between a 2-second and 20-second eval. - Theregressionslist fromharness.compare()is more valuable than the headline score delta. - OpenTelemetry GenAI attributes (gen_ai.system,gen_ai.request.model,gen_ai.usage.*) give you vendor-portable instrumentation. Switch backends without changing instrumentation code. - Observability is the engineering implementation of Transparency. Add spans before you ship, not after something breaks. - The Pareto frontier reveals which model delivers acceptable quality at the lowest cost. Real production frontiers have ≥3 axes (cost, accuracy, latency). - Complexity must be earned. Before adding the harness, verify it will inform a concrete decision.
-
Zheng et al. (2023) demonstrated this with GPT-4 preferring GPT-4 outputs and Claude-family models preferring Claude outputs. As of 2026, the bias persists across all major frontier families. ↩