代码审查 Agent：静态分析 + LLM 评审的最佳实践

传统 Code Review 依赖资深工程师的时间和注意力，但人力的瓶颈是物理的——一个 PR 平均要等 1-2 天才能拿到 review，资深工程师每天要花 1-3 小时在 review 上。LLM 的出现让"自动化代码审查"从幻想变成现实：但简单的"让 LLM 看 diff"效果很差，会产生大量误报、漏报和"无法验证的建议"。本文从工程实战出发，系统讲解代码审查 Agent 的分层架构、上下文设计、误报抑制和团队协作模式。

代码审查 Agent 的能力分层

不要把"代码审查 Agent"想成单一模型。它是分层系统，每层负责不同的检查任务：

第 1 层：静态分析（Static Analysis）

ESLint / Ruff / golangci-lint
类型检查（mypy、TypeScript）
安全扫描（CodeQL、Semgrep）
复杂度检测（cyclomatic complexity）

第 2 层：模式匹配（Pattern Matching）

业务规则（公司内部的 lint 规则）
历史 bug 模式（曾经踩过的坑）
API 使用约定（团队编码规范）

第 3 层：LLM 评审（LLM Review）

设计合理性
业务逻辑正确性
可读性、可维护性
测试覆盖度

第 4 层：上下文感知审查（Context-aware）

调用链分析（这个函数被谁调用、调用谁）
性能影响评估
安全上下文（认证、授权、输入验证）

每一层都有不可替代的价值，但也有明确的边界。试图让 LLM 替代所有层是常见误区——LLM 不擅长精确的模式匹配（"忘了删 console.log"），静态分析几行规则就能搞定的事不需要 LLM。

主流工具对比

工具	类型	能力范围	LLM 支持	部署模式
Sourcery	AI Code Review	Python 深度	是（自研）	SaaS
Qodo (CodiumAI)	AI Code Review	多语言	是	SaaS / 自托管
Codeball	AI Review	多语言	是	SaaS
Greptile	AI Code Review	多语言	是	SaaS
PR-Agent (Qodo)	开源 PR Agent	多语言	是	自托管
Codacy	静态 + AI	多语言	部分	SaaS / 自托管
SonarQube	静态分析	多语言	否	自托管

Sourcery 特别适合 Python 项目，对 Pandas、NumPy 风格有专门优化。PR-Agent 是开源，可以接入任何 Git 平台和任何 LLM。Codacy / SonarQube 适合"先做静态分析，再考虑 LLM"的渐进式部署。

PR-Agent 自托管方案

PR-Agent 是当前最灵活的开源方案，支持 GitHub、GitLab、Bitbucket：

# 安装 PR-Agent
pip install pr-agent

# 配置 .pr_agent.toml
[github]
user = "your-bot-user"
token = "ghp_xxxxx"

[config]
model = "gpt-4o"
custom_model = "openai/gpt-4o"

[pr_reviewer]
extra_instructions = """
- Pay attention to error handling: every external call should have a try/except
- Verify that the PR includes tests for new functionality
- Flag any direct database access in business logic (use repository pattern)
"""

PR-Agent 提供四种命令：

/review：综合审查
/describe：自动生成 PR 描述
/improve：建议代码改进
/add_docs：自动添加 docstring

# 触发自动 review
from pr_agent import PRAgent

agent = PRAgent()
result = agent.run(
    command="review",
    repo="myorg/myrepo",
    pr_number=123,
)
print(result)

LLM 评审的上下文设计

LLM 评审效果差的最常见原因不是模型不够好，而是上下文设计有问题。一个好的代码审查 Agent 的上下文应该包含：

class ReviewContext:
    def __init__(self, pr_diff, repo_path):
        self.pr_diff = pr_diff
        self.repo_path = repo_path
    
    def build_context(self) -> str:
        return f"""
## PR Diff
{self.pr_diff}

## Related files (full content for context)
{self._load_related_files()}

## Project conventions
{self._load_style_guide()}

## Recent changes to same area
{self._load_recent_history()}

## Architectural overview
{self._load_architecture_doc()}

## Review the PR for:
1. Correctness: does the code do what it claims?
2. Edge cases: what could go wrong?
3. Test coverage: are new functions tested?
4. Style: does it match project conventions?
5. Performance: any obvious bottlenecks?
6. Security: input validation, auth checks?

## Important: 
- Be specific. Reference line numbers and existing code.
- Distinguish "must fix" (bug, security) from "nice to have" (style).
- If the code is fine, say so. Don't invent issues.
"""
    
    def _load_related_files(self) -> str:
        """加载被修改文件相关的完整文件"""
        related = []
        for file in self.pr_diff.modified_files:
            full_path = self.repo_path / file
            if full_path.exists():
                related.append(f"### {file}\n```\n{full_path.read_text()}\n```")
        return "\n".join(related[:5])  # 限制 5 个文件
    
    def _load_style_guide(self) -> str:
        guide_path = self.repo_path / "STYLE_GUIDE.md"
        if guide_path.exists():
            return guide_path.read_text()
        return ""

上下文的关键设计：

Diff 是核心，但只靠 diff 不够——必须看到被修改函数的完整代码和调用方
加载项目约定——让 LLM 知道"这个项目用 repository pattern，不是直接 SQL"
加载架构文档——让 LLM 知道整体设计
明确分类输出——"must fix" vs "nice to have"，避免 LLM 把所有问题都标记为 blocker

误报抑制

LLM 评审的"噪声问题"是阻碍落地的最大障碍。一个 100 行的 PR 被标注 30 个"问题"，reviewer 就会失去耐心，最终忽略所有建议。

降低误报的工程化策略：

1. 在 system prompt 里明确"不要"的行为：

Do NOT flag:
- Stylistic preferences not enforced by linters
- Theoretical issues that cannot occur given existing code
- Issues that were already discussed and decided (e.g., the team chose a particular pattern)
- "You could also do X" without a clear reason

2. 限制输出长度：

MAX_OUTPUT_TOKENS = 800  # 不要让 LLM 输出长篇大论

3. 让 LLM 给"置信度"：

prompt = """
For each issue you find, give a confidence score (0-100):
- 90-100: Definitely a bug, must fix
- 70-89: Likely an issue, should fix
- 50-69: Could be an issue, discuss with author
- Below 50: Not sure, don't report
"""

4. 收集 review 反馈训练数据：

class FeedbackCollector:
    def record_feedback(self, pr_id: str, issue_id: str, action: str):
        """记录 reviewer 对每个 issue 的反应"""
        # action: 'agreed', 'disagreed', 'ignored', 'fixed'
        self.db.execute(
            "INSERT INTO review_feedback (pr_id, issue_id, action) VALUES (?, ?, ?)",
            (pr_id, issue_id, action)
        )
    
    def compute_precision(self, issue_type: str) -> float:
        """计算每个类型的精确率"""
        ...

5. 定期复盘 LLM 的建议质量：

# 每月统计
monthly_metrics = {
    "total_issues_reported": 0,
    "agreed_by_author": 0,
    "ignored": 0,
    "disagreed": 0,
    "precision": agreed / total,
}

如果 precision < 0.5，停止使用 LLM 评审或调整 prompt。

与 CI/CD 集成

代码审查 Agent 不应该是独立工具，应该嵌入 PR 工作流：

# .github/workflows/pr-agent.yml
name: PR Agent
on:
  pull_request:
    types: [opened, synchronize]

jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      
      - name: Run PR-Agent
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          pip install pr-agent
          pr-agent review --pr_url ${{ github.event.pull_request.html_url }}

集成模式：

PR 创建时自动 review：在 PR comment 里贴 LLM 评审结果
PR 更新时增量 review：只在修改的代码行上做评审
本地 pre-commit hook：开发者在 push 前看到 LLM 建议

团队协作模式

代码审查 Agent 不能"独立工作"，必须作为 review 流程的辅助：

模式 1：AI 优先 + 人工抽检

所有 PR 都先过 LLM 评审
Senior engineer 只抽检 LLM 标注为高风险（must fix）的部分
节省 senior 70% 的时间

模式 2：AI + 双盲人工 review

LLM 评审 + 随机分配的人类 reviewer
两者独立 review，对比结果
适合早期阶段，让团队建立对 AI 的信任

模式 3：AI 仅做格式检查

LLM 只检查代码风格、命名、文档
业务逻辑、架构仍由人类 review
适合对 AI 不信任的团队

模式 4：AI 完整 review + 人类批准

LLM 给出详细评审
人类 reviewer 只需"批准"或"驳回"
适合快速迭代的小型项目

实战案例

案例 1：某 SaaS 公司

每月 800 个 PR
引入 PR-Agent 后，senior engineer review 时间从 3 小时/天降到 1 小时/天
LLM 评审的 precision 0.65（每 3 个建议有 1 个被认可）
主要 false positive 类型：style 偏好、过度抽象

案例 2：某开源项目

用 PR-Agent 替代"新人友好 review"
LLM 自动给贡献者反馈（命名、错误处理、测试）
Maintainer 只在 LLM 标记为 blocker 的 PR 上花时间

案例 3：金融科技

合规要求所有 PR 必须经过 LLM 审查 + 人工审查
LLM 标记敏感代码（个人信息、加密、金额）进入合规审核队列
减少合规审计的人工成本 40%

度量指标

L1：LLM 评审本身的指标

建议数量、precision、recall
平均 review 时间
漏报率（事后发现的 bug 中，有多少没被 LLM 标记）

L2：流程指标

PR 平均 review 轮次
合并时间
资深工程师 review 时间占比

L3：质量指标

上线后的 bug 数
回归测试覆盖率
事故发生率

实施路径

第 1 周：选择工具（Sourcery / PR-Agent），部署到 GitHub 私有实例。第 2 周：选 5 个 repo 试点，配置项目约定到 system prompt。第 3 周：跟踪 precision 指标，收集 review 反馈。第 4 周：调优 prompt（"不要"清单、项目约定）。第 5 周：扩展到全公司，建立"AI 优先 + 人类抽检"工作流。第 6 周：建立月度复盘机制，对 precision < 0.5 的项目暂停使用。

总结

代码审查 Agent 不是"取代人类 reviewer"，而是把人类从机械的格式检查、风格审查、基础 bug 检测中解放出来，专注于架构决策和业务正确性。

落地关键：先做静态分析（零成本立刻见效），再加 LLM 评审（提升速度），最后做上下文感知审查（提升质量）。每一步都要有量化指标，precision 不达标就暂停。

参考工具：Sourcery（AI Code Review for Python）、PR-Agent (Qodo)（开源 PR Agent 工具）、CodiumAI（多语言 AI Code Review）、The PR-Agent（PR-Agent 早期版本）和 AlphaCodium（code generation + review）覆盖了代码审查 Agent 的核心工具链。

代码审查 Agent：静态分析 + LLM 评审的最佳实践

代码审查 Agent：静态分析 + LLM 评审的最佳实践

代码审查 Agent 的能力分层

主流工具对比

PR-Agent 自托管方案

LLM 评审的上下文设计

误报抑制

与 CI/CD 集成

团队协作模式

实战案例

度量指标

实施路径

总结

本文涉及的项目

Sourcery

AlphaCodium

PR-Agent

code-review-graph

LangChainGo