Zerox

Stale

GitHub TypeScript MIT

Description

OCR and document extraction tool using vision models, efficiently converting PDFs and images into structured text.

Related Projects

Crawlee

23.6k · TypeScript

Active

A web scraping and browser automation library for Node.js to build reliable crawlers, supporting Puppeteer, Playwright, Cheerio, and raw HTTP. Extract data for AI, LLMs, RAG, or GPTs with proxy rotation and both headful and headless modes.

typescriptjavascriptdata-processing +3

MinerU

66.2k · Python

Active

Transforms complex documents like PDFs into LLM-ready markdown/JSON for Agentic workflows, supporting layout analysis, formula recognition, and table extraction.

data-processingragpython +2