Code Review Bot#
A PR-triage agent that reads a unified diff, runs deterministic static checks
via tools, and produces a structured markdown review with a verdict of
APPROVE / REQUEST_CHANGES / NEEDS_DISCUSSION.
Problem#
Every engineering team wastes hours on PR reviews that are all mechanical: hardcoded secrets, missing tests, giant diffs, SQL built with string concatenation. These are not interesting things for a human to catch — but they slip through often enough to cause incidents.
This bot triages those mechanical issues automatically and leaves humans to argue about the interesting parts (architecture, naming, trade-offs). It produces a review comment that:
- Assigns one of three verdicts (
APPROVE,REQUEST_CHANGES,NEEDS_DISCUSSION). - Flags risks by category (security, tests, complexity, size).
- Emits inline comments sorted by severity, capped at a readable number.
- Appends a short reviewer-tone paragraph written by the LLM.
Architecture#
diff text
|
v
parse_diff ---------------+
| |
v |
analyze_pr (reviewer.py) | deterministic tools:
| | - check_security_patterns
v | - check_test_coverage
classify (reviewer.py) | - compute_complexity
| | - find_related_code
v | - look_up_style_guide
LLM summary (agent.py)------+
|
v
format_markdown -> GitHub comment
The deterministic layer does all the verdict-forming work. The LLM only writes the natural-language wrap-up paragraph. That split is deliberate:
- Policy is auditable. You can read
reviewer.pytop-to-bottom and know why a PR gotREQUEST_CHANGES. You do not have to trust a prompt. - Cost stays low. The model emits ~60 output tokens per review, not hundreds. Most of the work is regex and subprocess calls.
- Tests are deterministic. The verdict tests don't mock an LLM. They run real Python against real diffs.
Cost estimate#
At default settings (Claude Haiku for the summary, ~250 input / 60 output tokens per review):
- Haiku 4.5: ~$0.0003 per PR.
- Sonnet 4.6: ~$0.005 per PR.
- Opus 4.7: ~$0.05 per PR (only if you want the nicest prose).
The sample diffs average 30-100 input lines. Real PRs average 200-600. You pay roughly in proportion to the diff size; the tools do not feed every line to the model.
Tools#
| Tool | Purpose |
|---|---|
parse_diff |
Unified-diff parser; returns hunks with language tags. |
check_security_patterns |
Regex scan for hardcoded secrets, SQL/shell injection, weak hashes. |
check_test_coverage |
Did the diff touch source without touching tests? |
find_related_code |
Ripgrep over a repo path; useful for pulling in callers. |
look_up_style_guide |
Per-language rule list for anchoring style comments. |
compute_complexity |
Keyword-based cyclomatic-complexity estimate. |
Each tool returns a JSON string. The agent reads those strings as observations; the reviewer module interprets them into a verdict.
Running#
Mock mode is the default when there's no ANTHROPIC_API_KEY in the env.
SWARM_MOCK=true python -m projects.code_review_bot.agent \
projects/code_review_bot/sample_diffs/security_issue.diff
You can also use it programmatically:
from projects.code_review_bot.agent import review_pull_request
result = await review_pull_request(diff_text)
print(result.markdown)
print(result.report.verdict) # "REQUEST_CHANGES"
print(result.report.risks) # [RiskFlag(...)]
Integrating with GitHub (production)#
We do not ship a webhook listener — the bot is pipeline-agnostic. Wire it like this:
- Webhook receiver. A small FastAPI service that subscribes to GitHub's
pull_requestevents (opened, synchronized, reopened). Extract the PR number and the head SHA. - Diff fetch.
GET /repos/{owner}/{repo}/pulls/{number}with headerAccept: application/vnd.github.v3.diffreturns the raw unified diff. - Review.
await review_pull_request(diff_text). - Post. If
result.report.verdict != "APPROVE", postresult.markdownas a PR comment viaPOST /repos/{owner}/{repo}/issues/{number}/comments. ForAPPROVE, consider submitting a review withstate=APPROVEvia the pull-request review API instead. - Cache by SHA. The bot is deterministic for a given diff — cache the result keyed by head SHA so re-runs don't charge you twice.
Rate limits: GitHub allows ~5,000 API calls per hour per installation. At one review per PR and a healthy team, you will not come close.
Tests#
The test suite covers:
- Tool unit tests (each regex catches what it should, passes what it shouldn't).
- Reviewer policy tests (clean →
APPROVE, secrets →REQUEST_CHANGES, missing tests →NEEDS_DISCUSSION, huge diff →NEEDS_DISCUSSION). - Integration tests (
review_pull_requestproduces markdown with the LLM summary and the deterministic risks both present).