Code Review Agents: Static Analysis Plus LLM Review in Practice
A code review Agent is not just "let the LLM see the diff." This article systematically explains the layered architecture (static analysis, pattern matching, LLM review, context-aware), PR-Agent self-hosting, context design, false-positive suppression, and team collaboration models -- with precision metrics and adoption paths.
Code Review Agents: Static Analysis Plus LLM Review in Practice
Traditional code review depends on senior engineers' time and attention, but the human bottleneck is physical -- a PR waits 1-2 days for review, and senior engineers spend 1-3 hours per day reviewing. LLMs have made "automated code review" real, but the naive "let the LLM see the diff" approach produces too many false positives, missed issues, and unverifiable suggestions. This article provides a production-engineering deep dive into the layered architecture, context design, false-positive suppression, and team collaboration patterns of code review Agents.
Capability Layers
Do not think of a "code review Agent" as a single model. It is a layered system where each layer handles a different class of check:
Layer 1: Static Analysis
- ESLint, Ruff, golangci-lint
- Type checking (mypy, TypeScript)
- Security scanning (CodeQL, Semgrep)
- Complexity detection (cyclomatic complexity)
Layer 2: Pattern Matching
- Business rules (internal lint rules)
- Historical bug patterns (pains already felt)
- API usage conventions (team coding standards)
Layer 3: LLM Review
- Design soundness
- Business logic correctness
- Readability and maintainability
- Test coverage
Layer 4: Context-aware Review
- Call chain analysis (who calls this, what it calls)
- Performance impact assessment
- Security context (auth, authorization, input validation)
Each layer has irreplaceable value but also clear limits. Trying to make the LLM replace all layers is a common trap -- LLMs are bad at exact pattern matching ("forgot to remove console.log"), and a few lines of static analysis rules handle that without any LLM.
Mainstream Tool Comparison
| Tool | Type | Capability | LLM support | Deployment |
|---|---|---|---|---|
| Sourcery | AI code review | Python deep | Yes (in-house) | SaaS |
| Qodo (CodiumAI) | AI code review | Multi-language | Yes | SaaS / self-hosted |
| Codeball | AI review | Multi-language | Yes | SaaS |
| Greptile | AI code review | Multi-language | Yes | SaaS |
| PR-Agent (Qodo) | Open-source PR Agent | Multi-language | Yes | Self-hosted |
| Codacy | Static plus AI | Multi-language | Partial | SaaS / self-hosted |
| SonarQube | Static analysis | Multi-language | No | Self-hosted |
Sourcery suits Python projects, with specific optimizations for Pandas and NumPy style. PR-Agent is open source and integrates with any Git platform or LLM. Codacy / SonarQube suit incremental adoption: static analysis first, then LLM.
PR-Agent Self-Hosted Setup
PR-Agent is currently the most flexible open-source solution, supporting GitHub, GitLab, and Bitbucket:
pip install pr-agent
# .pr_agent.toml
[github]
user = "your-bot-user"
token = "ghp_xxxxx"
[config]
model = "gpt-4o"
custom_model = "openai/gpt-4o"
[pr_reviewer]
extra_instructions = """
- Pay attention to error handling: every external call should have a try/except
- Verify that the PR includes tests for new functionality
- Flag any direct database access in business logic (use repository pattern)
"""
PR-Agent offers four commands:
/review: comprehensive review/describe: auto-generate PR description/improve: suggest code improvements/add_docs: auto-add docstrings
from pr_agent import PRAgent
agent = PRAgent()
result = agent.run(
command="review",
repo="myorg/myrepo",
pr_number=123,
)
print(result)
Context Design for LLM Review
The most common reason LLM review underperforms is not "the model is not good enough" but "context design is broken." A good review Agent's context should include:
class ReviewContext:
def __init__(self, pr_diff, repo_path):
self.pr_diff = pr_diff
self.repo_path = repo_path
def build_context(self) -> str:
return f"""
## PR Diff
{self.pr_diff}
## Related files (full content for context)
{self._load_related_files()}
## Project conventions
{self._load_style_guide()}
## Recent changes to same area
{self._load_recent_history()}
## Architectural overview
{self._load_architecture_doc()}
## Review the PR for:
1. Correctness: does the code do what it claims?
2. Edge cases: what could go wrong?
3. Test coverage: are new functions tested?
4. Style: does it match project conventions?
5. Performance: any obvious bottlenecks?
6. Security: input validation, auth checks?
## Important:
- Be specific. Reference line numbers and existing code.
- Distinguish "must fix" (bug, security) from "nice to have" (style).
- If the code is fine, say so. Don't invent issues.
"""
def _load_related_files(self) -> str:
related = []
for file in self.pr_diff.modified_files:
full_path = self.repo_path / file
if full_path.exists():
related.append(f"### {file}\n```\n{full_path.read_text()}\n```")
return "\n".join(related[:5])
def _load_style_guide(self) -> str:
guide_path = self.repo_path / "STYLE_GUIDE.md"
if guide_path.exists():
return guide_path.read_text()
return ""
Key context design:
- Diff is core, but diff alone is not enough -- must see the modified function's full code and its callers
- Load project conventions -- let the LLM know "this project uses the repository pattern, not direct SQL"
- Load architecture docs -- let the LLM know the overall design
- Categorize output explicitly -- "must fix" vs "nice to have", preventing the LLM from marking everything as a blocker
False Positive Suppression
The "noise problem" of LLM review is the biggest barrier to adoption. A 100-line PR flagged with 30 "issues" will exhaust the reviewer's patience, and they will eventually ignore every suggestion.
Engineering strategies to reduce false positives:
1. Explicit "do NOT" instructions in the system prompt:
Do NOT flag:
- Stylistic preferences not enforced by linters
- Theoretical issues that cannot occur given existing code
- Issues that were already discussed and decided
- "You could also do X" without a clear reason
2. Cap output length:
MAX_OUTPUT_TOKENS = 800
3. Ask the LLM for a confidence score:
prompt = """
For each issue you find, give a confidence score (0-100):
- 90-100: Definitely a bug, must fix
- 70-89: Likely an issue, should fix
- 50-69: Could be an issue, discuss with author
- Below 50: Not sure, don't report
"""
4. Collect review feedback as training data:
class FeedbackCollector:
def record_feedback(self, pr_id: str, issue_id: str, action: str):
self.db.execute(
"INSERT INTO review_feedback (pr_id, issue_id, action) VALUES (?, ?, ?)",
(pr_id, issue_id, action)
)
def compute_precision(self, issue_type: str) -> float:
...
5. Periodically review LLM suggestion quality:
monthly_metrics = {
"total_issues_reported": 0,
"agreed_by_author": 0,
"ignored": 0,
"disagreed": 0,
"precision": agreed / total,
}
If precision drops below 0.5, stop using LLM review or adjust the prompt.
CI/CD Integration
The code review Agent should not be a standalone tool -- it should be embedded in the PR workflow:
# .github/workflows/pr-agent.yml
name: PR Agent
on:
pull_request:
types: [opened, synchronize]
jobs:
review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Run PR-Agent
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
pip install pr-agent
pr-agent review --pr_url ${{ github.event.pull_request.html_url }}
Integration patterns:
- Auto-review on PR open: post LLM review as a PR comment
- Incremental review on PR update: only review modified lines
- Local pre-commit hook: developer sees LLM suggestions before pushing
Team Collaboration Models
A code review Agent cannot work alone -- it must be part of the review flow:
Model 1: AI-first plus human spot-check
- Every PR goes through LLM review first
- Senior engineers only spot-check what the LLM flags as high-risk (must fix)
- Saves senior engineers 70% of review time
Model 2: AI plus double-blind human review
- LLM review plus randomly assigned human reviewer
- Both review independently; results are compared
- Suits the early phase, builds team trust in AI
Model 3: AI does format checks only
- LLM checks code style, naming, documentation
- Business logic and architecture still human-reviewed
- Suits teams skeptical of AI
Model 4: AI full review plus human approval
- LLM produces a detailed review
- Human reviewer just "approves" or "rejects"
- Suits small projects with rapid iteration
Case Studies
Case 1: A SaaS company
- 800 PRs per month
- After introducing PR-Agent, senior engineer review time dropped from 3 hours/day to 1 hour/day
- LLM review precision: 0.65 (1 of every 3 suggestions was accepted)
- Main false positive categories: style preferences, over-abstraction
Case 2: An open-source project
- Used PR-Agent to replace "newcomer-friendly review"
- LLM auto-gave contributors feedback (naming, error handling, tests)
- Maintainers only spent time on PRs the LLM marked as blockers
Case 3: A fintech
- Compliance required all PRs to go through LLM review plus human review
- LLM marked sensitive code (PII, encryption, amounts) into a compliance review queue
- Reduced compliance audit manual cost by 40%
Metrics
L1: LLM review itself
- Suggestion count, precision, recall
- Average review time
- Miss rate (what fraction of post-merge bugs were not flagged by the LLM)
L2: Process metrics
- Average review rounds per PR
- Time to merge
- Senior engineer review time share
L3: Quality metrics
- Post-merge bug count
- Regression test coverage
- Incident rate
Implementation Path
Week 1: Pick a tool (Sourcery or PR-Agent), deploy to a private GitHub instance. Week 2: Pilot on 5 repos, encode project conventions in the system prompt. Week 3: Track precision metrics, collect review feedback. Week 4: Tune the prompt ("do not" list, project conventions). Week 5: Roll out company-wide, build the "AI-first plus human spot-check" workflow. Week 6: Establish monthly review; pause projects with precision < 0.5.
Summary
A code review Agent does not "replace human reviewers" -- it frees humans from mechanical format checks, style review, and basic bug detection so they can focus on architectural decisions and business correctness.
Adoption keys: start with static analysis (zero cost, immediate value), then add LLM review (improves speed), then add context-aware review (improves quality). Every step needs quantitative metrics; if precision falls short, pause.
Reference tools: Sourcery (AI code review for Python), PR-Agent (Qodo) (open-source PR Agent tooling), CodiumAI (multi-language AI code review), The PR-Agent (PR-Agent legacy version), and AlphaCodium (code generation plus review) cover the core tooling of the code review Agent stack.
Projects in this article
Sourcery
1.8k ⭐Sourcery is an instant AI code review tool that automatically detects code issues, suggests refactoring, and improves code quality, integrating into developer workflows for real-time code review.
AlphaCodium
3.9k ⭐CodiumAI's SOTA method on the CodeContest benchmark.
PR-Agent
11.9k ⭐The original open-source AI PR reviewer. Automatically analyzes pull requests and generates code review feedback, improvement suggestions, and PR descriptions across GitHub, GitLab, and Bitbucket.
code-review-graph
19.0k ⭐Local-first code intelligence graph for MCP and CLI. Builds a persistent map of the codebase so AI coding tools read only what matters, with benchmarked context reductions on code review and large-repo workflows.
LangChainGo
9.5k ⭐LangChainGo is the Go implementation of LangChain, providing the easiest way to write LLM-based programs in Go with chains, agents, and tool integrations.