HolmesGPT

Active
GitHub Python Apache-2.0

Description

A CNCF Sandbox SRE Agent that automatically analyzes infrastructure logs and metrics to assist with incident diagnosis and system operations.

Key Features

  • Agentic loop for querying live observability data and identifying root causes
  • Deep integrations with Prometheus, Grafana, Datadog, Kubernetes, and more
  • Operator mode for 24/7 background monitoring with Slack alerts and auto-PRs
  • Bidirectional alert integration with AlertManager, PagerDuty, OpsGenie, Jira
  • Petabyte-scale data handling with server-side filtering and memory-safe execution
  • Supports any LLM provider: OpenAI, Anthropic, Azure, Bedrock, Gemini

Use Cases

πŸ’‘ Automated root cause analysis for production incidents across Kubernetes, VMs, and cloud
πŸ’‘ Continuous health checking of microservices with automatic regression detection
πŸ’‘ Post-deployment verification to ensure new releases are healthy
πŸ’‘ Incident triage by correlating alerts from multiple monitoring platforms
πŸ’‘ Automated remediation via GitHub PRs based on identified root causes

Quick Start

1. Install: pip install holmesgpt
2. Configure data sources in config.yaml (Kubernetes, Prometheus, etc.)
3. Set your LLM API key (e.g. OPENAI_API_KEY)
4. Run: holmes investigate "Why is my service unhealthy?"
5. Holmes will query connected data sources and report root causes

Related Projects