Presidio

Active
GitHub Python MIT

Description

Microsoft's open-source context-aware PII detection and de-identification SDK for text, images, and structured data, providing sensitive data protection for LLM applications and agents.

Key Features

  • Context-aware PII detection — Identifies credit card numbers, names, addresses, and other sensitive entities using NER, regex, rule logic, and checksums
  • Multiple de-identification modes — Supports masking, replacement, encryption, pseudonymization, and other anonymization strategies
  • Image PII redaction — Built-in image text recognition and PII region masking, with DICOM medical image support
  • Custom recognizers — Extend PII detection with custom recognizers and integrate external NLP models
  • Multi-language support — Built-in PII detection across multiple languages for global data compliance
  • Flexible deployment — Supports Python, PySpark, Docker, and Kubernetes deployment options

Use Cases

💡 Detect and redact PII in user input and model output before and after LLM calls
💡 Scan and anonymize sensitive information in RAG knowledge base documents
💡 Process personally identifiable information in customer support tickets and chat logs
💡 Redact patient information from medical images and documents
💡 Meet GDPR, HIPAA, CCPA, and other data protection regulation requirements

Quick Start

pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lg
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
text = "John's email is john@example.com, call him at 555-123-4567."
analyzer = AnalyzerEngine()
results = analyzer.analyze(text=text, language='en')
anonymizer = AnonymizerEngine()
print(anonymizer.anonymize(text=text, analyzer_results=results))

Related Projects