Code Review Agents: Static Analysis Plus LLM Review in Practice

Traditional code review depends on senior engineers' time and attention, but the human bottleneck is physical -- a PR waits 1-2 days for review, and senior engineers spend 1-3 hours per day reviewing. LLMs have made "automated code review" real, but the naive "let the LLM see the diff" approach produces too many false positives, missed issues, and unverifiable suggestions. This article provides a production-engineering deep dive into the layered architecture, context design, false-positive suppression, and team collaboration patterns of code review Agents.

Capability Layers

Do not think of a "code review Agent" as a single model. It is a layered system where each layer handles a different class of check:

Layer 1: Static Analysis

ESLint, Ruff, golangci-lint
Type checking (mypy, TypeScript)
Security scanning (CodeQL, Semgrep)
Complexity detection (cyclomatic complexity)

Layer 2: Pattern Matching

Business rules (internal lint rules)
Historical bug patterns (pains already felt)
API usage conventions (team coding standards)

Layer 3: LLM Review

Design soundness
Business logic correctness
Readability and maintainability
Test coverage

Layer 4: Context-aware Review

Call chain analysis (who calls this, what it calls)
Performance impact assessment
Security context (auth, authorization, input validation)

Each layer has irreplaceable value but also clear limits. Trying to make the LLM replace all layers is a common trap -- LLMs are bad at exact pattern matching ("forgot to remove console.log"), and a few lines of static analysis rules handle that without any LLM.

Mainstream Tool Comparison

Tool	Type	Capability	LLM support	Deployment
Sourcery	AI code review	Python deep	Yes (in-house)	SaaS
Qodo (CodiumAI)	AI code review	Multi-language	Yes	SaaS / self-hosted
Codeball	AI review	Multi-language	Yes	SaaS
Greptile	AI code review	Multi-language	Yes	SaaS
PR-Agent (Qodo)	Open-source PR Agent	Multi-language	Yes	Self-hosted
Codacy	Static plus AI	Multi-language	Partial	SaaS / self-hosted
SonarQube	Static analysis	Multi-language	No	Self-hosted

Sourcery suits Python projects, with specific optimizations for Pandas and NumPy style. PR-Agent is open source and integrates with any Git platform or LLM. Codacy / SonarQube suit incremental adoption: static analysis first, then LLM.

PR-Agent Self-Hosted Setup

PR-Agent is currently the most flexible open-source solution, supporting GitHub, GitLab, and Bitbucket:

pip install pr-agent

# .pr_agent.toml
[github]
user = "your-bot-user"
token = "ghp_xxxxx"

[config]
model = "gpt-4o"
custom_model = "openai/gpt-4o"

[pr_reviewer]
extra_instructions = """
- Pay attention to error handling: every external call should have a try/except
- Verify that the PR includes tests for new functionality
- Flag any direct database access in business logic (use repository pattern)
"""

PR-Agent offers four commands:

/review: comprehensive review
/describe: auto-generate PR description
/improve: suggest code improvements
/add_docs: auto-add docstrings

from pr_agent import PRAgent

agent = PRAgent()
result = agent.run(
    command="review",
    repo="myorg/myrepo",
    pr_number=123,
)
print(result)

Context Design for LLM Review

The most common reason LLM review underperforms is not "the model is not good enough" but "context design is broken." A good review Agent's context should include:

class ReviewContext:
    def __init__(self, pr_diff, repo_path):
        self.pr_diff = pr_diff
        self.repo_path = repo_path
    
    def build_context(self) -> str:
        return f"""
## PR Diff
{self.pr_diff}

## Related files (full content for context)
{self._load_related_files()}

## Project conventions
{self._load_style_guide()}

## Recent changes to same area
{self._load_recent_history()}

## Architectural overview
{self._load_architecture_doc()}

## Review the PR for:
1. Correctness: does the code do what it claims?
2. Edge cases: what could go wrong?
3. Test coverage: are new functions tested?
4. Style: does it match project conventions?
5. Performance: any obvious bottlenecks?
6. Security: input validation, auth checks?

## Important: 
- Be specific. Reference line numbers and existing code.
- Distinguish "must fix" (bug, security) from "nice to have" (style).
- If the code is fine, say so. Don't invent issues.
"""
    
    def _load_related_files(self) -> str:
        related = []
        for file in self.pr_diff.modified_files:
            full_path = self.repo_path / file
            if full_path.exists():
                related.append(f"### {file}\n```\n{full_path.read_text()}\n```")
        return "\n".join(related[:5])
    
    def _load_style_guide(self) -> str:
        guide_path = self.repo_path / "STYLE_GUIDE.md"
        if guide_path.exists():
            return guide_path.read_text()
        return ""

Key context design:

Diff is core, but diff alone is not enough -- must see the modified function's full code and its callers
Load project conventions -- let the LLM know "this project uses the repository pattern, not direct SQL"
Load architecture docs -- let the LLM know the overall design
Categorize output explicitly -- "must fix" vs "nice to have", preventing the LLM from marking everything as a blocker

False Positive Suppression

The "noise problem" of LLM review is the biggest barrier to adoption. A 100-line PR flagged with 30 "issues" will exhaust the reviewer's patience, and they will eventually ignore every suggestion.

Engineering strategies to reduce false positives:

1. Explicit "do NOT" instructions in the system prompt:

Do NOT flag:
- Stylistic preferences not enforced by linters
- Theoretical issues that cannot occur given existing code
- Issues that were already discussed and decided
- "You could also do X" without a clear reason

2. Cap output length:

MAX_OUTPUT_TOKENS = 800

3. Ask the LLM for a confidence score:

prompt = """
For each issue you find, give a confidence score (0-100):
- 90-100: Definitely a bug, must fix
- 70-89: Likely an issue, should fix
- 50-69: Could be an issue, discuss with author
- Below 50: Not sure, don't report
"""

4. Collect review feedback as training data:

class FeedbackCollector:
    def record_feedback(self, pr_id: str, issue_id: str, action: str):
        self.db.execute(
            "INSERT INTO review_feedback (pr_id, issue_id, action) VALUES (?, ?, ?)",
            (pr_id, issue_id, action)
        )
    
    def compute_precision(self, issue_type: str) -> float:
        ...

5. Periodically review LLM suggestion quality:

monthly_metrics = {
    "total_issues_reported": 0,
    "agreed_by_author": 0,
    "ignored": 0,
    "disagreed": 0,
    "precision": agreed / total,
}

If precision drops below 0.5, stop using LLM review or adjust the prompt.

CI/CD Integration

The code review Agent should not be a standalone tool -- it should be embedded in the PR workflow:

# .github/workflows/pr-agent.yml
name: PR Agent
on:
  pull_request:
    types: [opened, synchronize]

jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      
      - name: Run PR-Agent
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          pip install pr-agent
          pr-agent review --pr_url ${{ github.event.pull_request.html_url }}

Integration patterns:

Auto-review on PR open: post LLM review as a PR comment
Incremental review on PR update: only review modified lines
Local pre-commit hook: developer sees LLM suggestions before pushing

Team Collaboration Models

A code review Agent cannot work alone -- it must be part of the review flow:

Model 1: AI-first plus human spot-check

Every PR goes through LLM review first
Senior engineers only spot-check what the LLM flags as high-risk (must fix)
Saves senior engineers 70% of review time

Model 2: AI plus double-blind human review

LLM review plus randomly assigned human reviewer
Both review independently; results are compared
Suits the early phase, builds team trust in AI

Model 3: AI does format checks only

LLM checks code style, naming, documentation
Business logic and architecture still human-reviewed
Suits teams skeptical of AI

Model 4: AI full review plus human approval

LLM produces a detailed review
Human reviewer just "approves" or "rejects"
Suits small projects with rapid iteration

Case Studies

Case 1: A SaaS company

800 PRs per month
After introducing PR-Agent, senior engineer review time dropped from 3 hours/day to 1 hour/day
LLM review precision: 0.65 (1 of every 3 suggestions was accepted)
Main false positive categories: style preferences, over-abstraction

Case 2: An open-source project

Used PR-Agent to replace "newcomer-friendly review"
LLM auto-gave contributors feedback (naming, error handling, tests)
Maintainers only spent time on PRs the LLM marked as blockers

Case 3: A fintech

Compliance required all PRs to go through LLM review plus human review
LLM marked sensitive code (PII, encryption, amounts) into a compliance review queue
Reduced compliance audit manual cost by 40%

Metrics

L1: LLM review itself

Suggestion count, precision, recall
Average review time
Miss rate (what fraction of post-merge bugs were not flagged by the LLM)

L2: Process metrics

Average review rounds per PR
Time to merge
Senior engineer review time share

L3: Quality metrics

Post-merge bug count
Regression test coverage
Incident rate

Implementation Path

Week 1: Pick a tool (Sourcery or PR-Agent), deploy to a private GitHub instance. Week 2: Pilot on 5 repos, encode project conventions in the system prompt. Week 3: Track precision metrics, collect review feedback. Week 4: Tune the prompt ("do not" list, project conventions). Week 5: Roll out company-wide, build the "AI-first plus human spot-check" workflow. Week 6: Establish monthly review; pause projects with precision < 0.5.

Summary

A code review Agent does not "replace human reviewers" -- it frees humans from mechanical format checks, style review, and basic bug detection so they can focus on architectural decisions and business correctness.

Adoption keys: start with static analysis (zero cost, immediate value), then add LLM review (improves speed), then add context-aware review (improves quality). Every step needs quantitative metrics; if precision falls short, pause.

Reference tools: Sourcery (AI code review for Python), PR-Agent (Qodo) (open-source PR Agent tooling), CodiumAI (multi-language AI code review), The PR-Agent (PR-Agent legacy version), and AlphaCodium (code generation plus review) cover the core tooling of the code review Agent stack.

Code Review Agents: Static Analysis Plus LLM Review in Practice

Code Review Agents: Static Analysis Plus LLM Review in Practice

Capability Layers

Mainstream Tool Comparison

PR-Agent Self-Hosted Setup

Context Design for LLM Review

False Positive Suppression

CI/CD Integration

Team Collaboration Models

Case Studies

Metrics

Implementation Path

Summary

Projects in this article

Sourcery

AlphaCodium

PR-Agent

code-review-graph

LangChainGo