Crawlee Python

Active
GitHub Python Apache-2.0

Description

Crawlee for Python is Apify's web scraping and browser automation library, designed for reliable, headful or headless data collection.

Key Features

  • Unified API for HTTP scraping, headless browser, and Playwright-based crawlers
  • Automatic request queuing, retries, throttling, and proxy rotation
  • {"Pluggable HTTP clients":"httpx, curl-impersonate, and raw socket"}
  • Browser fingerprint management and stealth mode to bypass anti-bot defenses
  • Dataset and Key-Value Store integrations for structured storage of crawl results
  • Native Interoperability with the Apify platform for deploying crawlers to the cloud

Use Cases

💡 Building production web crawlers for e-commerce price monitoring
💡 Scraping JavaScript-rendered pages that require a real browser
💡 Feeding structured web data into RAG pipelines and downstream LLM agents
💡 Authoring reliable long-running crawlers with built-in retries and proxy management
💡 Migrating Node.js Crawlee projects to Python while keeping the same conceptual model

Quick Start

pip install crawlee
from crawlee.playwright_crawler import PlaywrightCrawler
crawler = PlaywrightCrawler()
@crawler.router.default_handler
async def handle(context):
    await context.page.goto(context.request.url)
    title = await context.page.title()
    await context.push_data({"url": context.request.url, "title": title})
await crawler.run(["https://example.com"])

Related Projects