Server data from the Official MCP Registry
Kubernetes RCA agent that sandboxes its own AI with OPA Gatekeeper. MCP-ready for any client.
Kubernetes RCA agent that sandboxes its own AI with OPA Gatekeeper. MCP-ready for any client.
Valid MCP server (2 strong, 4 medium validity signals). 4 known CVEs in dependencies Imported from the Official MCP Registry.
5 files analyzed · 4 issues found
Security scores are indicators to help you make informed decisions, not guarantees. Always review permissions before connecting any MCP server.
This plugin requests these system permissions. Most are normal for its category.
Set these up before or after installing:
Environment variable: AI_API_KEY
Environment variable: AI_MODEL
Environment variable: KUBECONFIG
From the project's GitHub README.
An Autonomous AI-Driven Root Cause Analysis Agent for Kubernetes
K8gentS is a continuously running observer inside your Kubernetes cluster. It watches the event stream for all Warning-class events — pod failures, node pressure, storage mount errors, config problems — and routes them through a Gemini-powered reasoning engine to produce root cause analysis and resolution steps. Its goal is to reduce MTTR and surface failure context that would otherwise require manual kubectl investigation.
Building the Kubernetes side of this is straightforward. The real challenge is making a diagnostic layer trustworthy when the engine behind it is fundamentally non-deterministic.
Traditional observability is built on guarantees. Alerts fire on known thresholds. Dashboards show reproducible numbers. Logs return consistent answers to the same query. SRE success depends on that predictability — it's what makes incident response repeatable and on-call sustainable.
An LLM-driven diagnostic layer breaks that contract. The same pod failure can produce three different plausible explanations across three different runs. Each may be coherent. Each may even be correct under different assumptions. But "plausible" is not the same as "right," and for infrastructure, the gap between them is where outages live.
Some of the specific problems I've been working through while building K8gentS:
1. Confidence scoring with an unbounded output space. The agent returns top-3 root causes with confidence metrics, but confidence in what, exactly? The model is not selecting from a fixed set of known failure modes — it's generating free-form hypotheses. A calibrated confidence score needs a reference distribution, and the distribution here is whatever the model happened to produce this run.
2. When to trust reasoning vs. fall back to deterministic checks. - THE ART Some failures (CrashLoopBackOff, OOMKilled) have well-traveled diagnostic paths and a deterministic check will be right every time. Others benefit from the model's ability to interpolate across signals. Drawing that line — and doing it at runtime — is non-trivial and where the art really lies.
3. Evaluating an agent that's supposed to find failures you didn't anticipate. The standard ML evaluation approach assumes you know what "correct" looks like. For a diagnostic agent, part of the value is catching novel failure modes — by definition, failures you couldn't pre-enumerate. So how do you decide what is wrong and what is right?
4. The "Tool in the Cluster" problem. How do you monitor the monitor? Currently the service is set to run as a service in the cluster, but what if the service itself causes resource exhaustion, or is experiencing failures itself? How can you identify if the service itself is the cause of your issue?
5. Determining the right model for this problem. With so many other types of Machine Learning models out there, is a non-deterministic large language model really the right choice or is another model better suited for infrastructure type problems?
These are the questions I'm actively working on. If you've solved any of them — or have a sharper framing than I've got — I'd like to hear it.
CrashLoopBackOff states, OOMKilled events, and other failure conditions such as Connectivity/DNS, Database Deadlock, or Secret/Config Missing.The agent is designed so that each layer independently limits blast radius — not as redundancy for its own sake, but because no single control is sufficient when the reasoning engine is non-deterministic.
| Layer | Mechanism | What it prevents |
|---|---|---|
| Pod security | runAsNonRoot, read-only filesystem, all Linux capabilities dropped | Container escape, privilege escalation |
| RBAC | Agent pod is strictly read-only; write verbs live only on k8gent-executor-sa | Agent compromise → cluster mutation |
| Ephemeral executor | Short-lived Jobs via k8gent-executor-sa; ttlSecondsAfterFinished=120 | Persistent foothold after remediation |
| OPA Gatekeeper | Rego policy enforced at the API server admission layer | Executor escaping its scope, even if RBAC is misconfigured |
| Log sanitization | Regex sweeper strips IPs, JWTs, API keys, emails before LLM call | Secrets exfiltration via LLM prompt |
| Rate limiting | Hourly circuit breakers and event debouncing | Noise-driven API budget exhaustion |
| Ingress-free comms | Slack Socket Mode; no exposed endpoints or Ingress rules | Inbound attack surface |
The OPA Gatekeeper policy (defined in deploy/helm/k8gents/templates/opa-gatekeeper/) explicitly blocks the executor service account from modifying serviceAccountName, enabling hostNetwork or hostPID, operating inside kube-system, or mutating any resource kind other than pods and deployments — enforced directly at the Kubernetes API admission layer, independent of RBAC.
K8gentS ships as a Helm chart. OPA Gatekeeper is a declared chart dependency — the security sandbox installs automatically alongside the agent.
kubectl authenticated to the target clusterAI_API_KEY)SLACK_BOT_TOKEN starting xoxb- and SLACK_APP_TOKEN starting xapp-)SLACK_CHANNEL_ID) and a comma-separated list of approver Slack user IDs (ALLOWED_APPROVERS)api.slack.com.xapp-...).chat:write and chat:write.public OAuth scopes → generates a Bot Token (xoxb-...).docker build -t your-registry/k8gent:latest .
docker push your-registry/k8gent:latest
Update image.repository in deploy/helm/k8gents/values.yaml to match your registry path.
# Fetch chart dependencies (downloads OPA Gatekeeper)
helm dependency update deploy/helm/k8gents
# Install — secrets are injected at deploy time, never stored in source
helm install k8gents deploy/helm/k8gents \
--namespace k8gent-system \
--create-namespace \
--set secrets.aiApiKey="YOUR_GEMINI_KEY" \
--set secrets.slackBotToken="xoxb-..." \
--set secrets.slackAppToken="xapp-..." \
--set secrets.slackChannelId="C12345678" \
--set secrets.allowedApprovers="U123456,U789012"
To disable the OPA sandbox (if your cluster already runs Gatekeeper with its own policies):
--set sandbox.enabled=false --set gatekeeper.enabled=false
kubectl logs -l app=k8gents -n k8gent-system -f
You should see the watcher connect to the cluster API and the Slack Socket Mode connection initialize.
Key environment variables (set via --set agent.* in Helm, or directly if running locally):
| Variable | Default | Description |
|---|---|---|
WATCH_NAMESPACES | all | Comma-separated namespaces to watch, or all for cluster-wide |
AI_MODEL | gemini-2.5-pro | Any model name supported by the Google GenAI SDK |
LOG_LEVEL | INFO | Python logging level |
REMEDIATION_MODE | api | api (Kubernetes client, safe in-cluster) or subprocess (kubectl, local dev only) |
Changing AI_MODEL requires no code changes — the agent routes all LLM calls through the configured model name dynamically.
What's implemented and working:
io.github.JDoornink/k8gentsWhat's genuinely unsolved:
Be the first to review this server!
by Modelcontextprotocol · Developer Tools
Read, search, and manipulate Git repositories programmatically
by Toleno · Developer Tools
Toleno Network MCP Server — Manage your Toleno mining account with Claude AI using natural language.
by mcp-marketplace · Developer Tools
Create, build, and publish Python MCP servers to PyPI — conversationally.