Skip to content

Code Review Bot#

A PR-triage agent that reads a unified diff, runs deterministic static checks via tools, and produces a structured markdown review with a verdict of APPROVE / REQUEST_CHANGES / NEEDS_DISCUSSION.

Problem#

Every engineering team wastes hours on PR reviews that are all mechanical: hardcoded secrets, missing tests, giant diffs, SQL built with string concatenation. These are not interesting things for a human to catch — but they slip through often enough to cause incidents.

This bot triages those mechanical issues automatically and leaves humans to argue about the interesting parts (architecture, naming, trade-offs). It produces a review comment that:

  • Assigns one of three verdicts (APPROVE, REQUEST_CHANGES, NEEDS_DISCUSSION).
  • Flags risks by category (security, tests, complexity, size).
  • Emits inline comments sorted by severity, capped at a readable number.
  • Appends a short reviewer-tone paragraph written by the LLM.

Architecture#

  diff text
     |
     v
  parse_diff  ---------------+
     |                        |
     v                        |
  analyze_pr  (reviewer.py)   |  deterministic tools:
     |                        |   - check_security_patterns
     v                        |   - check_test_coverage
  classify    (reviewer.py)   |   - compute_complexity
     |                        |   - find_related_code
     v                        |   - look_up_style_guide
  LLM summary (agent.py)------+
     |
     v
  format_markdown -> GitHub comment

The deterministic layer does all the verdict-forming work. The LLM only writes the natural-language wrap-up paragraph. That split is deliberate:

  • Policy is auditable. You can read reviewer.py top-to-bottom and know why a PR got REQUEST_CHANGES. You do not have to trust a prompt.
  • Cost stays low. The model emits ~60 output tokens per review, not hundreds. Most of the work is regex and subprocess calls.
  • Tests are deterministic. The verdict tests don't mock an LLM. They run real Python against real diffs.

Cost estimate#

At default settings (Claude Haiku for the summary, ~250 input / 60 output tokens per review):

  • Haiku 4.5: ~$0.0003 per PR.
  • Sonnet 4.6: ~$0.005 per PR.
  • Opus 4.7: ~$0.05 per PR (only if you want the nicest prose).

The sample diffs average 30-100 input lines. Real PRs average 200-600. You pay roughly in proportion to the diff size; the tools do not feed every line to the model.

Tools#

Tool Purpose
parse_diff Unified-diff parser; returns hunks with language tags.
check_security_patterns Regex scan for hardcoded secrets, SQL/shell injection, weak hashes.
check_test_coverage Did the diff touch source without touching tests?
find_related_code Ripgrep over a repo path; useful for pulling in callers.
look_up_style_guide Per-language rule list for anchoring style comments.
compute_complexity Keyword-based cyclomatic-complexity estimate.

Each tool returns a JSON string. The agent reads those strings as observations; the reviewer module interprets them into a verdict.

Running#

Mock mode is the default when there's no ANTHROPIC_API_KEY in the env.

SWARM_MOCK=true python -m projects.code_review_bot.agent \
    projects/code_review_bot/sample_diffs/security_issue.diff

You can also use it programmatically:

from projects.code_review_bot.agent import review_pull_request

result = await review_pull_request(diff_text)
print(result.markdown)
print(result.report.verdict)       # "REQUEST_CHANGES"
print(result.report.risks)         # [RiskFlag(...)]

Integrating with GitHub (production)#

We do not ship a webhook listener — the bot is pipeline-agnostic. Wire it like this:

  1. Webhook receiver. A small FastAPI service that subscribes to GitHub's pull_request events (opened, synchronized, reopened). Extract the PR number and the head SHA.
  2. Diff fetch. GET /repos/{owner}/{repo}/pulls/{number} with header Accept: application/vnd.github.v3.diff returns the raw unified diff.
  3. Review. await review_pull_request(diff_text).
  4. Post. If result.report.verdict != "APPROVE", post result.markdown as a PR comment via POST /repos/{owner}/{repo}/issues/{number}/comments. For APPROVE, consider submitting a review with state=APPROVE via the pull-request review API instead.
  5. Cache by SHA. The bot is deterministic for a given diff — cache the result keyed by head SHA so re-runs don't charge you twice.

Rate limits: GitHub allows ~5,000 API calls per hour per installation. At one review per PR and a healthy team, you will not come close.

Tests#

SWARM_MOCK=true .venv/bin/pytest projects/code_review_bot/ -v

The test suite covers:

  • Tool unit tests (each regex catches what it should, passes what it shouldn't).
  • Reviewer policy tests (clean → APPROVE, secrets → REQUEST_CHANGES, missing tests → NEEDS_DISCUSSION, huge diff → NEEDS_DISCUSSION).
  • Integration tests (review_pull_request produces markdown with the LLM summary and the deterministic risks both present).