Code Review Bot#

A PR-triage agent that reads a unified diff, runs deterministic static checks via tools, and produces a structured markdown review with a verdict of APPROVE / REQUEST_CHANGES / NEEDS_DISCUSSION.

Problem#

Every engineering team wastes hours on PR reviews that are all mechanical: hardcoded secrets, missing tests, giant diffs, SQL built with string concatenation. These are not interesting things for a human to catch — but they slip through often enough to cause incidents.

This bot triages those mechanical issues automatically and leaves humans to argue about the interesting parts (architecture, naming, trade-offs). It produces a review comment that:

Assigns one of three verdicts (APPROVE, REQUEST_CHANGES, NEEDS_DISCUSSION).
Flags risks by category (security, tests, complexity, size).
Emits inline comments sorted by severity, capped at a readable number.
Appends a short reviewer-tone paragraph written by the LLM.

Architecture#

  diff text
     |
     v
  parse_diff  ---------------+
     |                        |
     v                        |
  analyze_pr  (reviewer.py)   |  deterministic tools:
     |                        |   - check_security_patterns
     v                        |   - check_test_coverage
  classify    (reviewer.py)   |   - compute_complexity
     |                        |   - find_related_code
     v                        |   - look_up_style_guide
  LLM summary (agent.py)------+
     |
     v
  format_markdown -> GitHub comment

The deterministic layer does all the verdict-forming work. The LLM only writes the natural-language wrap-up paragraph. That split is deliberate:

Policy is auditable. You can read reviewer.py top-to-bottom and know why a PR got REQUEST_CHANGES. You do not have to trust a prompt.
Cost stays low. The model emits ~60 output tokens per review, not hundreds. Most of the work is regex and subprocess calls.
Tests are deterministic. The verdict tests don't mock an LLM. They run real Python against real diffs.

Cost estimate#

At default settings (Claude Haiku for the summary, ~250 input / 60 output tokens per review):

Haiku 4.5: ~$0.0003 per PR.
Sonnet 4.6: ~$0.005 per PR.
Opus 4.7: ~$0.05 per PR (only if you want the nicest prose).

The sample diffs average 30-100 input lines. Real PRs average 200-600. You pay roughly in proportion to the diff size; the tools do not feed every line to the model.

Tools#

Tool	Purpose
`parse_diff`	Unified-diff parser; returns hunks with language tags.
`check_security_patterns`	Regex scan for hardcoded secrets, SQL/shell injection, weak hashes.
`check_test_coverage`	Did the diff touch source without touching tests?
`find_related_code`	Ripgrep over a repo path; useful for pulling in callers.
`look_up_style_guide`	Per-language rule list for anchoring style comments.
`compute_complexity`	Keyword-based cyclomatic-complexity estimate.

Each tool returns a JSON string. The agent reads those strings as observations; the reviewer module interprets them into a verdict.

Running#

Mock mode is the default when there's no ANTHROPIC_API_KEY in the env.

SWARM_MOCK=true python -m projects.code_review_bot.agent \
    projects/code_review_bot/sample_diffs/security_issue.diff

You can also use it programmatically:

from projects.code_review_bot.agent import review_pull_request

result = await review_pull_request(diff_text)
print(result.markdown)
print(result.report.verdict)       # "REQUEST_CHANGES"
print(result.report.risks)         # [RiskFlag(...)]

Integrating with GitHub (production)#

We do not ship a webhook listener — the bot is pipeline-agnostic. Wire it like this:

Webhook receiver. A small FastAPI service that subscribes to GitHub's pull_request events (opened, synchronized, reopened). Extract the PR number and the head SHA.
Diff fetch. GET /repos/{owner}/{repo}/pulls/{number} with header Accept: application/vnd.github.v3.diff returns the raw unified diff.
Review. await review_pull_request(diff_text).
Post. If result.report.verdict != "APPROVE", post result.markdown as a PR comment via POST /repos/{owner}/{repo}/issues/{number}/comments. For APPROVE, consider submitting a review with state=APPROVE via the pull-request review API instead.
Cache by SHA. The bot is deterministic for a given diff — cache the result keyed by head SHA so re-runs don't charge you twice.

Rate limits: GitHub allows ~5,000 API calls per hour per installation. At one review per PR and a healthy team, you will not come close.

Tests#

SWARM_MOCK=true .venv/bin/pytest projects/code_review_bot/ -v

The test suite covers:

Tool unit tests (each regex catches what it should, passes what it shouldn't).
Reviewer policy tests (clean → APPROVE, secrets → REQUEST_CHANGES, missing tests → NEEDS_DISCUSSION, huge diff → NEEDS_DISCUSSION).
Integration tests (review_pull_request produces markdown with the LLM summary and the deterministic risks both present).