Unstructured

Active
GitHub HTML Apache-2.0

Description

Unstructured provides document parsing and cleaning capabilities, commonly used in RAG ingestion and preprocessing pipelines.

Key Features

  • Open-source document parsing for PDFs, HTML, Word docs, and more
  • Modular partitioning functions for text extraction and structure detection
  • Docker support with multi-platform images for x86_64 and Apple Silicon
  • Integration-ready for RAG ingestion and preprocessing pipelines
  • Supports images, tables, and complex document layouts
  • PyPI installable with local development setup

Use Cases

πŸ’‘ Preprocessing unstructured documents for LLM ingestion
πŸ’‘ Building RAG pipelines that need reliable document parsing
πŸ’‘ Extracting text and tables from PDFs for downstream analysis
πŸ’‘ Automating data preprocessing in AI/ML workflows
πŸ’‘ Converting mixed document formats into structured outputs

Quick Start

1. Pull the Docker image: `docker pull downloads.unstructured.io/unstructured-io/unstructured:latest`.
2. Or install from PyPI: `pip install unstructured`.
3. Run partitioning on your documents using the `partition` function.
4. Use the structured output in your RAG or LLM pipeline.

Related Projects

Related Articles