Unstructured

Active

GitHub HTML Apache-2.0

Description

Unstructured provides document parsing and cleaning capabilities, commonly used in RAG ingestion and preprocessing pipelines.

Key Features

Open-source document parsing for PDFs, HTML, Word docs, and more
Modular partitioning functions for text extraction and structure detection
Docker support with multi-platform images for x86_64 and Apple Silicon
Integration-ready for RAG ingestion and preprocessing pipelines
Supports images, tables, and complex document layouts
PyPI installable with local development setup

Use Cases

💡 Preprocessing unstructured documents for LLM ingestion

💡 Building RAG pipelines that need reliable document parsing

💡 Extracting text and tables from PDFs for downstream analysis

💡 Automating data preprocessing in AI/ML workflows

💡 Converting mixed document formats into structured outputs

Quick Start

1. Pull the Docker image: `docker pull downloads.unstructured.io/unstructured-io/unstructured:latest`.
2. Or install from PyPI: `pip install unstructured`.
3. Run partitioning on your documents using the `partition` function.
4. Use the structured output in your RAG or LLM pipeline.

Visit GitHub Visit Website View Docs

Related Projects

Sparrow

5.2k · Python

Active

Sparrow is a structured data extraction tool that supports instruction calling with ML, LLM, and Vision LLM for extracting structured information from documents, suitable for document parsing in RAG pipelines.

data-extractiondocument-processingllm +3

RAGatouille

3.9k · Python

Stale

Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.

ragpythonembedding +1