反爬与浏览器指纹：现代 Web 抓取策略

10 年前，Python requests + BeautifulSoup 能搞定 90% 的抓取任务。2025 年的 Web 已经从"静态 HTML"演化为"反爬战场"——Cloudflare、Akamai、DataDome 等专业反爬服务部署在 70% 以上的头部网站，简单的 HTTP 请求会被立刻识别为机器人。本文从工程实战出发，系统讲解现代 Web 抓取的反爬对抗策略、指纹伪装技术、合法合规边界与工具选型。

现代反爬的两层防线

第一层：HTTP 层（被动检测）

User-Agent 字符串
HTTP 头完整性（Accept-Language、Accept-Encoding）
TLS 指纹（JA3/JA4）
IP 信誉（数据中心 IP vs 住宅 IP）
请求频率与模式

第二层：浏览器层（主动验证）

JavaScript 挑战（计算 PoW、解析混淆代码）
CAPTCHA（图片识别、reCAPTCHA、hCaptcha）
行为分析（鼠标轨迹、键盘节奏）
Canvas / WebGL 指纹
自动化框架检测（navigator.webdriver、window.chrome.runtime）

单纯靠"加 User-Agent"的时代已经过去了。一个真正的"无头浏览器"在 30 秒内会被识别为机器人。

HTTP 层的反检测

# 1. User-Agent 伪装
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
}

# 2. TLS 指纹伪装（curl_cffi）
from curl_cffi import requests as cffi_requests

response = cffi_requests.get(
    "https://example.com",
    headers=HEADERS,
    impersonate="chrome120",  # 模拟 Chrome 120 的 TLS 指纹
)

# 3. IP 轮换
import itertools

proxies = [
    "http://user:pass@residential-proxy-1.com:8000",
    "http://user:pass@residential-proxy-2.com:8000",
]
proxy_pool = itertools.cycle(proxies)

for url in urls:
    proxy = next(proxy_pool)
    response = cffi_requests.get(url, proxies={"http": proxy, "https": proxy})

curl_cffi 通过使用 libcurl 模拟真实浏览器的 TLS 握手指纹（JA3/JA4），是 HTTP 层反检测的关键工具。impersonate="chrome120" 让 TLS 指纹和 Chrome 120 完全一致。

浏览器层：Playwright + Stealth

from playwright.sync_api import sync_playwright

# playwright-stealth 隐藏自动化痕迹
from playwright_stealth import stealth_sync

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    context = browser.new_context(
        viewport={"width": 1920, "height": 1080},
        user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    )
    page = context.new_page()
    
    # 应用 stealth 配置
    stealth_sync(page)
    
    page.goto("https://example.com")
    
    # 检查是否被识别为机器人
    is_bot = page.evaluate("""
        () => ({
            webdriver: navigator.webdriver,
            languages: navigator.languages,
            plugins: navigator.plugins.length,
            chrome: !!window.chrome,
        })
    """)
    print(is_bot)
    # 期望: { webdriver: undefined, languages: ['en-US', 'en'], plugins: 5, chrome: true }
    
    page.screenshot(path="result.png", full_page=True)
    browser.close()

playwright-stealth 修补了 20+ 个自动化检测点：

navigator.webdriver = false
navigator.languages = ["en-US", "en"]
navigator.plugins 注入 5 个真实插件
window.chrome.runtime 注入
Canvas 指纹噪音
WebGL 渲染器噪音
等等

高级反检测：人类行为模拟

import random
import time

def human_like_delay():
    time.sleep(random.gauss(1.5, 0.5))  # 正态分布的等待

def human_like_mouse_move(page, target_x, target_y):
    """模拟人类的鼠标移动"""
    current_x, current_y = page.evaluate("() => ({x: 0, y: 0})")  # 起点
    steps = random.randint(20, 40)
    for i in range(steps):
        progress = i / steps
        # 贝塞尔曲线，让路径不像机器
        control_x = current_x + (target_x - current_x) * random.uniform(0.3, 0.7)
        control_y = current_y + (target_y - current_y) * random.uniform(0.3, 0.7)
        x = current_x + (target_x - current_x) * progress + random.gauss(0, 2)
        y = current_y + (target_y - current_y) * progress + random.gauss(0, 2)
        page.mouse.move(x, y)
        time.sleep(random.uniform(0.005, 0.02))

def human_like_typing(page, selector, text):
    """模拟人类打字速度"""
    page.click(selector)
    for char in text:
        page.keyboard.type(char, delay=random.randint(50, 200))
        # 偶尔停顿（"思考"）
        if random.random() < 0.1:
            time.sleep(random.uniform(0.3, 1.0))

反自动化检测的关键行为指标：

鼠标轨迹不是直线
滚动有加速减速（不是匀速）
打字有快有慢
偶尔停顿、错字再纠正

Cloudflare 绕过

Cloudflare 是最常见的反爬服务。绕过 Cloudflare 的几种方法：

方法 1：cloudscraper（适用于简单网站）

import cloudscraper

scraper = cloudscraper.create_scraper()
response = scraper.get("https://example.com")

方法 2：FlareSolverr（自托管，开源）

docker run -d --name flaresolverr -p 8191:8191 ghcr.io/flaresolverr/flaresolverr:latest

import requests

# 通过 FlareSolverr 获取 Cloudflare 清除的 cookie
response = requests.post("http://localhost:8191/v1", json={
    "cmd": "request.get",
    "url": "https://example.com",
    "maxTimeout": 60000,
})
cookies = response.json()["solution"]["cookies"]

# 用获取的 cookie 继续抓取
session = requests.Session()
for cookie in cookies:
    session.cookies.set(cookie["name"], cookie["value"])
response = session.get("https://example.com")

方法 3：undetected-chromedriver

import undetected_chromedriver as uc

driver = uc.Chrome(headless=True)
driver.get("https://example.com")

方法 4：商业 API 服务

服务	价格	特点
Bright Data	$0.5-1/GB	业界标准
ScraperAPI	$0.0005/请求	便宜
Oxylabs	$0.5-2/GB	高质量
Zyte (Scrapinghub)	企业级	全栈方案
Browserless	$0.05/小时	专门浏览器 API

工具选型

工具	优势	劣势	适合
requests + BS4	简单	无 JS 渲染	静态网站
curl_cffi	TLS 指纹	仍可能被 JS 挑战	轻度反爬
Playwright + stealth	完整浏览器	慢、贵	中度反爬
Scrapling	选择器自愈	新项目，文档少	中度反爬
Steel Browser	云端 stealth	付费	高匿名
Browserless	云端	付费	高匿名
Pydoll	无浏览器 CDP	新	反指纹检测
FlareSolverr	专门 Cloudflare	自托管	Cloudflare 站点

Scrapling 的杀手锏是"选择器自愈"——当网站改版、selector 失效时，它自动找到新 selector，而不是直接失败。

Pydoll 是 2024 年出现的"无浏览器爬虫"——用 CDP（Chrome DevTools Protocol）但不用完整浏览器，绕过了所有 navigator.webdriver 检测。

住宅 IP 与代理轮换

# Smartproxy 配置
PROXY = "http://user:pass@gate.smartproxy.com:7000"

# 每次请求用不同 IP
response = requests.get(
    url,
    proxies={"http": PROXY, "https": PROXY},
)

# Sticky session（10 分钟内同一 IP）
session_id = "session-" + str(random.randint(1, 10000))
response = requests.get(
    url,
    proxies={"http": f"{PROXY}-session-{session_id}", "https": f"{PROXY}-session-{session_id}"},
)

代理类型对比：

类型	成本	速度	风险	适合
数据中心	低	快	高（被识别）	一般爬虫
住宅	中	中	中	中度反爬
移动	高	慢	低	高匿名
ISP	高	快	低	高匿名 + 高速度

合法合规边界

必须遵守的规则：

robots.txt：先读后爬，不爬禁止的目录
CFAA（美国）/ GDPR（欧盟）/ 中国《数据安全法》：仅采集公开数据，不绕过认证
服务条款：很多网站 ToS 禁止自动化抓取
数据使用：抓取的数据仅用于合法目的，不转售、不用于歧视

红线：

❌ 绕过认证抓取私有数据
❌ 抓取 PII（个人身份信息）并存储
❌ 故意绕过 CAPTCHAs
❌ 抓取频率导致目标网站服务降级
✅ 抓取公开数据、遵守 robots.txt、控制频率

失败模式与处理

失败模式	表现	处理
IP 被封	429 错误	切换 IP / 降低频率
JS 挑战	返回 403 + 验证页	上 FlareSolverr
指纹被识	返回 CAPTCHA	上 undetected-chromedriver
行为被识	提交后失败	引入更真实的人类行为模拟
网站改版	选择器失效	用 Scrapling 的自愈能力

实施路径

第 1 周：选目标网站，读 robots.txt，确认抓取合规。第 2 周：用 requests + curl_cffi 抓取静态内容。第 3 周：如遇 JS 渲染，升级到 Playwright + stealth。第 4 周：如遇 Cloudflare，接入 FlareSolverr 或换商业 API。第 5 周：建立 IP 池 + 频率控制，监控被封率。第 6 周：建立数据 pipeline + 长期存储 + 异常告警。

总结

现代 Web 抓取已从"程序员练手项目"演化为"专业反爬对抗工程"。HTTP 层靠 curl_cffi 模拟 TLS 指纹，浏览器层靠 Playwright + stealth 隐藏自动化痕迹，行为层靠人类行为模拟，云端反爬靠商业 API 或自托管 FlareSolverr。

但永远记住：合规是底线。绕过技术只是工具，合法使用才不会被法律追责。抓取前先读 robots.txt、确认 ToS、控制频率、保护用户隐私。

参考工具：Scrapling（选择器自愈的现代爬虫库）、Steel Browser（云端 stealth 浏览器）、Browserless（云端浏览器 API）、Pydoll（无浏览器 CDP 爬虫）和 Scrapling（同 Scrapling）覆盖了反爬工具链的核心节点。

反爬与浏览器指纹：现代 Web 抓取策略

反爬与浏览器指纹：现代 Web 抓取策略

现代反爬的两层防线

HTTP 层的反检测

浏览器层：Playwright + Stealth

高级反检测：人类行为模拟

Cloudflare 绕过

工具选型

住宅 IP 与代理轮换

合法合规边界

失败模式与处理

实施路径

总结

本文涉及的项目

Scrapling

Steel Browser

Browserless

Pydoll

Firecrawl