Code Review Agents: Static Analysis Plus LLM Review in Practice

A code review Agent is not just "let the LLM see the diff." This article systematically explains the layered architecture (static analysis, pattern matching, LLM review, context-aware), PR-Agent self-hosting, context design, false-positive suppression, and team collaboration models -- with precision metrics and adoption paths.

AgentList · 2026年7月1日
Code ReviewLLMPR-AgentCI/CD代码审查

Code Review Agents: Static Analysis Plus LLM Review in Practice

Traditional code review depends on senior engineers' time and attention, but the human bottleneck is physical -- a PR waits 1-2 days for review, and senior engineers spend 1-3 hours per day reviewing. LLMs have made "automated code review" real, but the naive "let the LLM see the diff" approach produces too many false positives, missed issues, and unverifiable suggestions. This article provides a production-engineering deep dive into the layered architecture, context design, false-positive suppression, and team collaboration patterns of code review Agents.

Capability Layers

Do not think of a "code review Agent" as a single model. It is a layered system where each layer handles a different class of check:

Layer 1: Static Analysis

  • ESLint, Ruff, golangci-lint
  • Type checking (mypy, TypeScript)
  • Security scanning (CodeQL, Semgrep)
  • Complexity detection (cyclomatic complexity)

Layer 2: Pattern Matching

  • Business rules (internal lint rules)
  • Historical bug patterns (pains already felt)
  • API usage conventions (team coding standards)

Layer 3: LLM Review

  • Design soundness
  • Business logic correctness
  • Readability and maintainability
  • Test coverage

Layer 4: Context-aware Review

  • Call chain analysis (who calls this, what it calls)
  • Performance impact assessment
  • Security context (auth, authorization, input validation)

Each layer has irreplaceable value but also clear limits. Trying to make the LLM replace all layers is a common trap -- LLMs are bad at exact pattern matching ("forgot to remove console.log"), and a few lines of static analysis rules handle that without any LLM.

Mainstream Tool Comparison

Tool Type Capability LLM support Deployment
Sourcery AI code review Python deep Yes (in-house) SaaS
Qodo (CodiumAI) AI code review Multi-language Yes SaaS / self-hosted
Codeball AI review Multi-language Yes SaaS
Greptile AI code review Multi-language Yes SaaS
PR-Agent (Qodo) Open-source PR Agent Multi-language Yes Self-hosted
Codacy Static plus AI Multi-language Partial SaaS / self-hosted
SonarQube Static analysis Multi-language No Self-hosted

Sourcery suits Python projects, with specific optimizations for Pandas and NumPy style. PR-Agent is open source and integrates with any Git platform or LLM. Codacy / SonarQube suit incremental adoption: static analysis first, then LLM.

PR-Agent Self-Hosted Setup

PR-Agent is currently the most flexible open-source solution, supporting GitHub, GitLab, and Bitbucket:

pip install pr-agent
# .pr_agent.toml
[github]
user = "your-bot-user"
token = "ghp_xxxxx"

[config]
model = "gpt-4o"
custom_model = "openai/gpt-4o"

[pr_reviewer]
extra_instructions = """
- Pay attention to error handling: every external call should have a try/except
- Verify that the PR includes tests for new functionality
- Flag any direct database access in business logic (use repository pattern)
"""

PR-Agent offers four commands:

  • /review: comprehensive review
  • /describe: auto-generate PR description
  • /improve: suggest code improvements
  • /add_docs: auto-add docstrings
from pr_agent import PRAgent

agent = PRAgent()
result = agent.run(
    command="review",
    repo="myorg/myrepo",
    pr_number=123,
)
print(result)

Context Design for LLM Review

The most common reason LLM review underperforms is not "the model is not good enough" but "context design is broken." A good review Agent's context should include:

class ReviewContext:
    def __init__(self, pr_diff, repo_path):
        self.pr_diff = pr_diff
        self.repo_path = repo_path
    
    def build_context(self) -> str:
        return f"""
## PR Diff
{self.pr_diff}

## Related files (full content for context)
{self._load_related_files()}

## Project conventions
{self._load_style_guide()}

## Recent changes to same area
{self._load_recent_history()}

## Architectural overview
{self._load_architecture_doc()}

## Review the PR for:
1. Correctness: does the code do what it claims?
2. Edge cases: what could go wrong?
3. Test coverage: are new functions tested?
4. Style: does it match project conventions?
5. Performance: any obvious bottlenecks?
6. Security: input validation, auth checks?

## Important: 
- Be specific. Reference line numbers and existing code.
- Distinguish "must fix" (bug, security) from "nice to have" (style).
- If the code is fine, say so. Don't invent issues.
"""
    
    def _load_related_files(self) -> str:
        related = []
        for file in self.pr_diff.modified_files:
            full_path = self.repo_path / file
            if full_path.exists():
                related.append(f"### {file}\n```\n{full_path.read_text()}\n```")
        return "\n".join(related[:5])
    
    def _load_style_guide(self) -> str:
        guide_path = self.repo_path / "STYLE_GUIDE.md"
        if guide_path.exists():
            return guide_path.read_text()
        return ""

Key context design:

  • Diff is core, but diff alone is not enough -- must see the modified function's full code and its callers
  • Load project conventions -- let the LLM know "this project uses the repository pattern, not direct SQL"
  • Load architecture docs -- let the LLM know the overall design
  • Categorize output explicitly -- "must fix" vs "nice to have", preventing the LLM from marking everything as a blocker

False Positive Suppression

The "noise problem" of LLM review is the biggest barrier to adoption. A 100-line PR flagged with 30 "issues" will exhaust the reviewer's patience, and they will eventually ignore every suggestion.

Engineering strategies to reduce false positives:

1. Explicit "do NOT" instructions in the system prompt:

Do NOT flag:
- Stylistic preferences not enforced by linters
- Theoretical issues that cannot occur given existing code
- Issues that were already discussed and decided
- "You could also do X" without a clear reason

2. Cap output length:

MAX_OUTPUT_TOKENS = 800

3. Ask the LLM for a confidence score:

prompt = """
For each issue you find, give a confidence score (0-100):
- 90-100: Definitely a bug, must fix
- 70-89: Likely an issue, should fix
- 50-69: Could be an issue, discuss with author
- Below 50: Not sure, don't report
"""

4. Collect review feedback as training data:

class FeedbackCollector:
    def record_feedback(self, pr_id: str, issue_id: str, action: str):
        self.db.execute(
            "INSERT INTO review_feedback (pr_id, issue_id, action) VALUES (?, ?, ?)",
            (pr_id, issue_id, action)
        )
    
    def compute_precision(self, issue_type: str) -> float:
        ...

5. Periodically review LLM suggestion quality:

monthly_metrics = {
    "total_issues_reported": 0,
    "agreed_by_author": 0,
    "ignored": 0,
    "disagreed": 0,
    "precision": agreed / total,
}

If precision drops below 0.5, stop using LLM review or adjust the prompt.

CI/CD Integration

The code review Agent should not be a standalone tool -- it should be embedded in the PR workflow:

# .github/workflows/pr-agent.yml
name: PR Agent
on:
  pull_request:
    types: [opened, synchronize]

jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      
      - name: Run PR-Agent
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          pip install pr-agent
          pr-agent review --pr_url ${{ github.event.pull_request.html_url }}

Integration patterns:

  • Auto-review on PR open: post LLM review as a PR comment
  • Incremental review on PR update: only review modified lines
  • Local pre-commit hook: developer sees LLM suggestions before pushing

Team Collaboration Models

A code review Agent cannot work alone -- it must be part of the review flow:

Model 1: AI-first plus human spot-check

  • Every PR goes through LLM review first
  • Senior engineers only spot-check what the LLM flags as high-risk (must fix)
  • Saves senior engineers 70% of review time

Model 2: AI plus double-blind human review

  • LLM review plus randomly assigned human reviewer
  • Both review independently; results are compared
  • Suits the early phase, builds team trust in AI

Model 3: AI does format checks only

  • LLM checks code style, naming, documentation
  • Business logic and architecture still human-reviewed
  • Suits teams skeptical of AI

Model 4: AI full review plus human approval

  • LLM produces a detailed review
  • Human reviewer just "approves" or "rejects"
  • Suits small projects with rapid iteration

Case Studies

Case 1: A SaaS company

  • 800 PRs per month
  • After introducing PR-Agent, senior engineer review time dropped from 3 hours/day to 1 hour/day
  • LLM review precision: 0.65 (1 of every 3 suggestions was accepted)
  • Main false positive categories: style preferences, over-abstraction

Case 2: An open-source project

  • Used PR-Agent to replace "newcomer-friendly review"
  • LLM auto-gave contributors feedback (naming, error handling, tests)
  • Maintainers only spent time on PRs the LLM marked as blockers

Case 3: A fintech

  • Compliance required all PRs to go through LLM review plus human review
  • LLM marked sensitive code (PII, encryption, amounts) into a compliance review queue
  • Reduced compliance audit manual cost by 40%

Metrics

L1: LLM review itself

  • Suggestion count, precision, recall
  • Average review time
  • Miss rate (what fraction of post-merge bugs were not flagged by the LLM)

L2: Process metrics

  • Average review rounds per PR
  • Time to merge
  • Senior engineer review time share

L3: Quality metrics

  • Post-merge bug count
  • Regression test coverage
  • Incident rate

Implementation Path

Week 1: Pick a tool (Sourcery or PR-Agent), deploy to a private GitHub instance. Week 2: Pilot on 5 repos, encode project conventions in the system prompt. Week 3: Track precision metrics, collect review feedback. Week 4: Tune the prompt ("do not" list, project conventions). Week 5: Roll out company-wide, build the "AI-first plus human spot-check" workflow. Week 6: Establish monthly review; pause projects with precision < 0.5.

Summary

A code review Agent does not "replace human reviewers" -- it frees humans from mechanical format checks, style review, and basic bug detection so they can focus on architectural decisions and business correctness.

Adoption keys: start with static analysis (zero cost, immediate value), then add LLM review (improves speed), then add context-aware review (improves quality). Every step needs quantitative metrics; if precision falls short, pause.

Reference tools: Sourcery (AI code review for Python), PR-Agent (Qodo) (open-source PR Agent tooling), CodiumAI (multi-language AI code review), The PR-Agent (PR-Agent legacy version), and AlphaCodium (code generation plus review) cover the core tooling of the code review Agent stack.