Chapter 05: Evaluation & Observability#

In this chapter - Why "APPROVE" is a soft signal and what rigorous measurement looks like - How LLM-as-judge works, what biases it has, and how to mitigate them - How to build an eval harness that runs on every commit, in parallel, in seconds - What OpenTelemetry's GenAI attributes give you, and why they matter for vendor portability - How to plot a Pareto frontier across models and pick the right one for your quality target - Why observability is not just operational hygiene but an implementation of the Transparency principle

1. Motivation#

Your Chapter 04 critic said "APPROVE" after round 2. But was the code actually better? You don't know. You're trusting the model to evaluate itself. It may approve bad code because it looks plausible, give a 7/10 to one version and an 8/10 to an identical version in a different conversation, or bias toward closure after N rounds. The "APPROVE" is a soft signal, not a measurement.

This chapter builds the measurement layer: LLM-as-judge with bias awareness, an eval harness you can run every commit, and OpenTelemetry traces with GenAI semantic attributes (the gen_ai.* convention). After this chapter, "better" has a number.

2. First Principles#

What makes a good eval?#

Three properties:

Validity: it measures what actually matters. An eval that checks whether output contains certain keywords may be measuring verbosity, not correctness.
Reliability: the same input produces the same verdict (or close). An eval with high variance is noise. You can't detect a 2% quality improvement through 20% evaluation noise.
Cost: cheap enough to run on every commit, model swap, prompt change. An eval you run once a week is too slow to be useful.

Human evaluators score highly on validity and reliability, but fail on cost. LLM judges score well on cost but have known biases. The practical answer: LLM judge for continuous evaluation, human for final validation.

Position bias#

Zheng et al. (2023, arXiv:2306.05685) quantified a systematic problem: GPT-4 prefers whichever answer appears first roughly 60% of the time, even when answers are swapped in a second evaluation and it prefers the same content in second position.

This is position bias: the model has a prior for "the first answer is better" regardless of quality. The bias is consistent across models and task categories. An evaluator with position bias is measuring presentation order, not quality.

The mitigation: evaluate in both orderings (A vs B, then B vs A) and average the scores. If the winner flips, flag it: position_bias_detected=True. This doubles the cost of pairwise evaluation but converts an unreliable signal into a reliable one.

Verbosity bias#

LLMs also tend to prefer longer answers. A 500-word response to a yes/no question may score higher than a concise "yes" even when the concise answer is correct.

The mitigation is reference-backed scoring: compare the output to a gold standard rather than evaluating in isolation. A one-word "Paris" scores 1.0 against a reference of "Paris". A 200-word explanation scores lower on conciseness rubrics.

Pairwise vs scalar#

Scalar scoring: give the output 0.0 to 1.0. Simple. Aggregatable. Comparable across runs.
Pairwise preference: ask "which is better, A or B?" More reliable but can't be directly aggregated.

We use both. judge.score() for continuous eval runs (need numbers). judge.pairwise() for model comparison (need relative ranking).

Callout: The 0.7 threshold and domain calibration

The pass threshold is score >= 0.7, a reasonable starting point, not a universal law.

Domain-dependent: - Code: 0.8 or higher. A false pass risks deploying broken software. - Factual Q&A: 0.75 or higher. Factual errors compound when users act on them. - Creative writing: 0.5 or lower. Quality is subjective. - Summarization: 0.65 typical. Some paraphrase is acceptable.

Calibrate with Exercise 01: compare judge scores to your own manual scores on 10 cases. Compute Pearson correlation. Adjust the threshold until the judge's pass/fail decisions match yours 80%+ of the time.

3. The Intellectual Lineage#

MT-Bench and Chatbot Arena (Zheng et al., 2023)#

"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (Zheng et al., 2023, arXiv:2306.05685) makes three contributions:

MT-Bench: a multi-turn benchmark with 80 questions across 8 categories (writing, roleplay, reasoning, math, coding, extraction, STEM, humanities). LLM judges correlate with human judgment (Spearman 0.85) when biases are controlled.

Position bias quantification: evaluators prefer first-presented answers ~60% of the time regardless of quality. This motivates the double-evaluation strategy in judge.pairwise(). Without double evaluation, roughly 10% of your pairwise comparisons are decided by presentation order rather than quality.

Self-enhancement bias: LLMs rate outputs matching their own style more highly. Cross-provider judges show the same pattern.¹ Mitigation: use a different model as judge than the models being evaluated, or use multiple judges and average.

The Elo Rating System for LLMs (Chatbot Arena)#

Chatbot Arena (Chiang et al., 2024, arXiv:2403.04132) extended pairwise evaluation to human preference at scale: users compare two anonymized model outputs and vote. The resulting Elo ratings, the same system used in chess, are the most widely trusted benchmark for conversational AI quality. Aggregated pairwise human preferences converge faster than rubric-based scoring because "which is better?" is an easier judgment than "rate this from 0 to 10."

For your eval harness: include pairwise alongside scalar. Pairwise is more reliable for closely matched models; scalar is easier to aggregate over time.

SWE-bench (Jimenez et al., 2023)#

"SWE-bench: Can Language Models Resolve Real-world GitHub Issues?" (Jimenez et al., 2023, arXiv:2310.06770) is the gold standard for code evaluation. It uses real GitHub issues with real test suites as the pass/fail oracle. No LLM judge. Tests pass or fail.

Whenever you can replace an LLM judge with a verifiable oracle (test suite, type checker, linter, schema validator), do so. Use LLM judges only for judgment calls that can't be automated.

LLM-as-Judge architecture#

graph TD
    subgraph "Inputs"
        OUT["Candidate Output"]
        REF["Reference (optional)"]
        RUB["Rubric (optional)"]
    end
    subgraph "LLMJudge.score()"
        PROMPT["Build judge prompt"]
        LLM["LLM call (different model)"]
        PARSE["Parse score + reasoning"]
    end
    subgraph "Output"
        RESULT["EvalResult\nscore, reasoning, pass"]
    end
    OUT --> PROMPT
    REF -.->|optional| PROMPT
    RUB -.->|optional| PROMPT
    PROMPT --> LLM --> PARSE --> RESULT

Use a different model as judge than the models being evaluated. Cross-provider judging reduces self-enhancement bias.

4. Build It#

Open code/eval.py.

The `EvalCase` and `EvalResult` dataclasses#

@dataclass
class EvalCase:
    id: str
    input: str
    expected_output: str | None = None
    tags: list[str] = field(default_factory=list)
    metadata: dict = field(default_factory=dict)

expected_output is optional: for reference-backed scoring you have it, for rubric-based scoring you don't. tags let you slice: [r for r in run.results if "math" in case_by_id[r.case_id].tags]. Invest in tags from day one, so you can slice by category as your suite grows.

The judge system prompt:

You are an expert evaluator. Score the AI response on a 1-10 scale.

Criteria:
- Correctness: Does the response accurately address the task?
- Completeness: Are all aspects covered?
- Clarity: Is the response well-structured?

Return ONLY: {"score": <int 1-10>, "rationale": "<one sentence>"}

`LLMJudge.score()`#

async def score(self, output, *, reference=None, rubric=""):
    if reference:
        prompt = f"Reference:\n{reference}\n\nCandidate:\n{output}\n\n..."
    else:
        prompt = f"Output:\n{output}\n\n{rubric}\n\n..."
    result = await self._call(prompt, system=_JUDGE_SYSTEM, max_tokens=100)
    return self._parse_score_response(result.text)

[full: swarm/eval/harness.py:80-140]

Always return reasoning. When a case fails, a bare score tells you nothing about whether the failure was a hallucination, format error, or factual mistake.

The max_tokens=100 cap is intentional. Judge outputs should be a score and one sentence. If you let the judge reason at length, you're paying for text the harness never reads.

`LLMJudge.pairwise()`: position bias mitigation#

async def pairwise(self, output_a, output_b, *, prompt) -> dict:
    # Forward: A first
    score_a_fwd, score_b_fwd = await self._compare_once(
        output_a, output_b, prompt=prompt)
    # Reversed: B first
    score_b_rev, score_a_rev = await self._compare_once(
        output_b, output_a, prompt=prompt)

    avg_a = (score_a_fwd + score_a_rev) / 2
    avg_b = (score_b_fwd + score_b_rev) / 2
    position_bias_detected = winner_fwd != winner_rev

Two evaluations, one each ordering. If the winner flips, the judge is position-biased on this pair. Running both and averaging cancels the directional bias. position_bias_detected=True surfaces cases where the candidates are genuinely close and the judge is uncertain.

sequenceDiagram
    participant Caller
    participant Judge

    Caller->>Judge: pairwise(A, B)
    Note over Judge: Forward pass: A first
    Judge-->>Caller: score_a_fwd=0.7, score_b_fwd=0.3

    Caller->>Judge: pairwise(B, A)
    Note over Judge: Reversed pass: B first
    Judge-->>Caller: score_b_rev=0.6, score_a_rev=0.4

    Note over Caller: avg_a=0.55, avg_b=0.45
    Note over Caller: winner_fwd=A, winner_rev=B → position_bias_detected
    Caller-->>Caller: flag for human review

`EvalHarness.run()` and `.compare()`#

async def run(self, model, system) -> EvalRun:
    results = list(await asyncio.gather(*[_run_one(c) for c in self.cases]))

All cases run concurrently. At 50 cases, this is the difference between a 5-second eval and a 50-second eval.

The .compare() method returns a regressions list, cases that passed before and now fail, or scores that dropped by more than 0.1. The > 0.1 threshold is calibrated to judge variance (~0.05-0.08 on repeated evaluations of identical inputs). If you use a weaker judge with higher variance, raise to 0.15.

The regressions list is where you catch silent failures. A model swap that improves average score but regresses on 3 specific cases may be unacceptable.

def compare(self, run_a, run_b) -> dict:
    regressions = [
        {"case_id": cid, "score_a": scores_a[cid], "score_b": scores_b[cid]}
        for cid in scores_a
        if cid in scores_b and scores_a[cid] - scores_b[cid] > 0.1
    ]

Observability: Measuring the Process#

The eval harness measures output quality. Observability measures process health: how long did it take, how many tokens, where did it fail? Both are required for production.

The `Span` and `Tracer` classes#

@dataclass
class Span:
    name: str
    attributes: dict = field(default_factory=dict)
    def set(self, key, value) -> None: ...
    def finish(self) -> None: ...

[full: swarm/observability/tracer.py:20-80]

Self-contained, no OTel SDK required. Attribute names follow the OpenTelemetry Semantic Conventions for Generative AI:

gen_ai.system: "anthropic" | "openai" | "litellm"
gen_ai.request.model: the model ID
gen_ai.operation.name: the agent role
gen_ai.usage.input_tokens: input token count
gen_ai.usage.output_tokens: output token count
gen_ai.usage.cost_usd: USD cost (extension, not in the official spec yet)

Using canonical attribute names means you can swap in the real OTel SDK later and your traces will parse correctly in Grafana, Jaeger, Honeycomb, without changing instrumentation code.

Sidebar: Why OTel Semantic Conventions Matter

OpenTelemetry is a CNCF standard for distributed tracing. The "semantic conventions" are the agreed-upon attribute names for common operations. The GenAI conventions (gen_ai.*) were finalized in 2024 and are now supported by every major observability platform.

The practical benefit is vendor portability. Instrument your agent with gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, and you can switch from Honeycomb to Datadog to Grafana without changing a line of instrumentation code. The collector changes; the spans stay the same.

The gen_ai.usage.cost_usd attribute in this chapter's code is an extension. The spec avoids encoding pricing (it's provider-specific). Add it as a custom attribute.

5. Transparency as Engineering#

Observability is the engineering implementation of the Transparency principle: agents should surface their planning and reasoning rather than acting as black boxes. Every LLM call produces a span with the attributes above. Surface this record: to operators via a dashboard, to users via a "thinking" panel, to yourself via the span log.

graph TB
    subgraph "Agent System"
        G[Generator] -->|span| T[(Tracer)]
        C[Critic] -->|span| T
        R[Refiner] -->|span| T
    end
    subgraph "Observability Stack"
        T -->|OTLP| COL[OTel Collector]
        COL --> GRAF[Grafana / Jaeger]
        COL --> HON[Honeycomb / Datadog]
    end
    subgraph "User-Facing"
        T -->|real-time| PANEL[Thinking Panel]
        T -->|post-run| AUDIT[Audit Log]
    end

Observability is a feature, not a debugging tool. Add spans before you ship. Users and operators need visibility to trust the system, not just developers trying to diagnose failures.

6. Run It#

SWARM_MOCK=true python modules/07_eval_observability/code/eval.py

Output (mock mode):

── EvalHarness ──────────────────────────────────────
Run ID:     a3f2c8d1
Cases:      5
Passed:     4/5
Avg score:  0.7820

Case         Score    Pass    Output
geo-01       0.8200   PASS    Mock answer for: What is the cap
math-01      0.8200   PASS    Mock answer for: What is 12 × 12
code-01      0.8200   PASS    Mock answer for: Write a Python

In real mode, some cases will fail, especially code-01 where the model may return an explanation instead of the bare expression, dropping the score below 0.7.

7. Observe It#

The eval pipeline#

The eval harness is not a one-off script. It's a pipeline that runs on every meaningful change:

flowchart LR
    CHANGE[Code / prompt / model change] --> RUN[harness.run]
    RUN --> COMPARE[baseline vs candidate]
    COMPARE --> REG{Regressions?}
    REG -->|Yes| BLOCK[Block deploy]
    REG -->|No| PARETO[Plot Pareto]
    PARETO --> DECIDE[Pick model on frontier]

Running is cheap (<10s for 20 cases). A regression reaching production is not. Wire it into CI.

The Pareto frontier#

xychart-beta
    title "Model Pareto Frontier: Quality vs Cost"
    x-axis ["Haiku ($0.001)", "GPT-4o-mini ($0.002)", "Sonnet ($0.005)", "GPT-4o ($0.01)", "Opus ($0.02)"]
    y-axis "Avg Eval Score" 0.70 --> 1.00
    line [0.78, 0.82, 0.91, 0.89, 0.96]

harness.pareto_point(run) returns (avg_score, total_cost), one point on this chart. Run the same eval with Haiku, Sonnet, Opus. The efficient frontier is the curve where no point is dominated (higher score at lower or equal cost). Sonnet typically sits on it: meaningfully better than Haiku, much cheaper than Opus.

Don't default to the best model. Default to the cheapest model on the efficient frontier for your quality target.

Plotting is one-liner-plus:

import matplotlib.pyplot as plt

runs = {"haiku": haiku_run, "sonnet": sonnet_run, "opus": opus_run}
for name, run in runs.items():
    score, cost = harness.pareto_point(run)
    plt.scatter(cost, score, label=name)

plt.xlabel("Total cost (USD)"); plt.ylabel("Avg score")
plt.legend(); plt.title("Pareto Frontier"); plt.show()

If Sonnet is northeast of Haiku and southwest of Opus, it's on the efficient frontier.

Callout: Real Pareto frontiers have ≥3 axes

The cost × accuracy plot is a simplification. Real production frontiers have at least three axes: - Cost: total USD per task - Accuracy: eval score - Latency: p95 response time in milliseconds

For some deployments, add more: compliance risk, privacy tier, data residency.

Multi-objective optimization across three or more axes has no single "Pareto optimal" point, only a surface. Practical approaches: Pareto ranking (rank by how many others dominate you); constraint satisfaction (fix latency and cost constraints, maximize accuracy within them); weighted sum (assign weights, compute a scalar; sensitive to weights).

8. The Observability Stack#

Each LLM call is one span:

span = tracer.start_span("generate")
span.set("gen_ai.system", "anthropic")
span.set("gen_ai.request.model", model)
span.set("gen_ai.operation.name", "generate")
span.set("gen_ai.usage.input_tokens", result.usage.input_tokens)
span.set("gen_ai.usage.output_tokens", result.usage.output_tokens)
span.finish()

In production, spans nest: the outer span covers the entire refinement loop; child spans cover each generator, critic, and refiner call. The nesting makes latency analysis possible without parsing log timestamps.

Instrument at two levels:

Every LLM call: gen_ai.* attributes, latency, error.
Every logical operation: generator, critic, refiner, eval case, handoff.

Level 1 gives you cost and token tracking. Level 2 gives operational insight: which agents are slowest, which tasks cause critics to take multiple rounds, which eval cases consistently fail. Don't instrument finer unless a specific question requires it. Over-instrumentation floods the backend.

Sidebar: OTel Collector Architecture

The OpenTelemetry collector is a vendor-neutral proxy between your application and your observability backend: Application → OTLP exporter → Collector → Backend. The collector handles batching, retry, and fan-out. Your app sends to localhost:4317; the collector handles everything else.

Ship the in-memory Tracer for development, replace with the OTel SDK for production:
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)
The canonical gen_ai.* attributes set in this chapter will parse correctly by any OTel-compatible backend.

9. Break It#

Run your Chapter 04 refinement loop and measure it with this chapter's harness. Two generators, one critic. Which combo produces the best Pareto point?

state_a = await run_refinement_loop(task, model="claude-haiku-4-5-20251001")
state_b = await run_refinement_loop(task, model="claude-sonnet-4-6")

Run both final outputs through judge.pairwise(). Which is better? What does it cost? (See modules/07_eval_observability/what_goes_wrong.py for a demo where the judge scores an empty response 0.7 because the prompt told it to be lenient, a cautionary tale about rubric design.)

Sidebar: Anti-pattern: Eval Harness Theater

An eval harness whose results are never read is not an eval harness. It's theater.

Signs: - The harness runs in CI but no one looks unless a deploy fails - The same 3 failing cases have been in the harness for 6 weeks - The pass threshold was set at project start and has never been revisited - You added the harness to satisfy a compliance requirement, not to catch regressions

The fix: make the output actionable. Wire it into deploy as a hard gate. If detect_regressions() returns items, the deploy is blocked. If the gate is blocking too many valid deploys, fix the regressions; don't lower the threshold.

Chapter 06 forces this question at N=20 parallel workers: which model produces acceptable quality at the lowest cost? The harness is how you find out.

10. Exercises#

Exercise 01: Calibrate the Judge (`exercises/01_calibrate_judge.py`)#

Compare LLM judge scores to your own manual scores on 10 cases. Compute Pearson correlation (ranges -1 to +1). scipy.stats.pearsonr(human_scores, llm_scores)[0].

Expected: 0.7-0.9 for factual tasks, 0.5-0.7 for subjective. Below 0.5, the judge is measuring something different from what you care about. Reconsider the rubric.

Use the correlation to set your pass threshold: if the judge consistently scores 0.1 higher than you do, set threshold at 0.8.

Exercise 02: Detect Position Bias (`exercises/02_position_bias.py`)#

Run judge.pairwise() 20 times on the same A/B pair. Track how often the winner flips when A↔B are swapped.

Expected: 15-25% flip rate on close pairs, 0-5% on clearly different-quality pairs. Above 30% means the judge is not reliable for that comparison. Use a stronger judge model.

Exercise 03: Regression Detector (`exercises/03_regression_detector.py`)#

Given two EvalRuns, flag cases where score dropped more than threshold. Wire into your deployment: if detect_regressions(baseline, candidate) returns items, block. The threshold=0.1 default catches meaningful regressions (0.9 → 0.79) while ignoring judge variance.

11. Summary#

Key takeaways - LLM judges have position bias (prefer first-presented ~60%) and verbosity bias. Mitigate with double evaluation and reference-backed scoring. - The pass threshold (score >= 0.7) is a starting point. Calibrate with Exercise 01 by measuring Pearson correlation between judge and manual scores. - Run all eval cases concurrently with asyncio.gather. For 20 cases this is the difference between a 2-second and 20-second eval. - The regressions list from harness.compare() is more valuable than the headline score delta. - OpenTelemetry GenAI attributes (gen_ai.system, gen_ai.request.model, gen_ai.usage.*) give you vendor-portable instrumentation. Switch backends without changing instrumentation code. - Observability is the engineering implementation of Transparency. Add spans before you ship, not after something breaks. - The Pareto frontier reveals which model delivers acceptable quality at the lowest cost. Real production frontiers have ≥3 axes (cost, accuracy, latency). - Complexity must be earned. Before adding the harness, verify it will inform a concrete decision.

Zheng et al. (2023) demonstrated this with GPT-4 preferring GPT-4 outputs and Claude-family models preferring Claude outputs. As of 2026, the bias persists across all major frontier families. ↩