Appendix B: Benchmarks Methodology#

This appendix covers the major benchmarks used to evaluate agentic systems as of April 2026. For each benchmark we describe: what it measures, how it is scored, the April 2026 SOTA numbers, and where to find it.

These benchmarks are used in Module 12 (Capstone) to evaluate the swarm you have built. Understanding the methodology before you run is important — benchmark scores are easy to misread without it.

SWE-bench Verified#

What it measures: The ability of an agent to resolve real GitHub issues from open-source Python repositories. Each task is a GitHub issue plus the repository state at the time of filing. The agent must produce a patch that passes the repository's existing test suite.

How scored: Percentage of issues resolved — defined as the agent's patch passing all pre-existing tests plus any new tests added by the issue's ground-truth solution. "Verified" refers to the human-verified subset (500 problems) where the issue and tests were manually confirmed to be solvable.

April 2026 SOTA: ~72% (Claude-based systems on the verified set). The full SWE-bench (2,294 issues) is harder; SOTA is ~55%.

Link: https://www.swebench.com

SWE-bench Lite#

What it measures: A curated subset of 300 SWE-bench problems selected for difficulty balance and reproducibility. Used in this course (Module 12) because it is cheaper to run than the full benchmark.

How scored: Same as SWE-bench Verified — percentage of patches passing all tests.

April 2026 SOTA: ~68% on Lite. The Lite subset tends to be slightly easier than the full set.

Link: https://www.swebench.com/lite

Course note: benchmarks/swe_bench_lite/runner.py is the harness we use in Module 12.

GAIA (General AI Assistants)#

What it measures: Real-world question answering requiring multi-step reasoning, tool use, and web research. GAIA questions are designed to be trivially easy for humans but hard for current LLMs. Three levels of difficulty:

Level 1: Single-step, minimal tool use (e.g., "What is the capital of the country whose flag has the most stars?")
Level 2: Multi-step, moderate tool use and reasoning chains
Level 3: Complex, requires sustained multi-step planning and diverse tool use

How scored: Exact match on the final answer. No partial credit. Binary pass/fail per question.

April 2026 SOTA: L1: ~92%, L2: ~75%, L3: ~55%. Numbers vary significantly by system design.

Link: https://huggingface.co/spaces/gaia-benchmark/leaderboard

WebArena#

What it measures: Completing tasks in realistic simulated web environments (shopping sites, Reddit, GitLab, etc.). The agent must navigate, fill forms, click, and reason about web state.

How scored: Task completion rate — whether the specified end state is achieved. Tasks are functional (e.g., "Buy the cheapest item in this category and leave a review").

April 2026 SOTA: ~45–50% depending on system. WebArena remains one of the harder benchmarks because it requires sustained multi-step web interaction.

Link: https://webarena.dev

HumanEval+#

What it measures: Code generation correctness. An extended version of OpenAI's HumanEval, with more test cases per problem to reduce false positives (problems that pass with an incorrect but lucky solution).

How scored: Pass@1 — percentage of problems where the first generated solution passes all test cases.

April 2026 SOTA: ~90%+ for frontier models. HumanEval+ is now close to saturated; SWE-bench is the more meaningful coding benchmark.

Link: https://github.com/evalplus/evalplus

A Note on Benchmark Gaming#

All benchmarks can be gamed. SWE-bench scores can be inflated by training on the test set. GAIA scores can be inflated by hardcoding answers for known questions. When reading leaderboard numbers, check: (a) whether the system was trained on any data from the test split, (b) whether the evaluation is verified by a third party, and (c) whether the numbers are for the standard evaluation or a custom subset.

For this course, we run SWE-bench Lite and GAIA L1 with zero post-training on test data. The score you get is the honest score.