βKubeWise: Let AI manage the chaos, so you can focus on innovation.β
Managing modern Kubernetes clusters is a relentless battle against complexity. SREs and DevOps teams are constantly challenged to:
Traditional monitoring tools often provide lagging indicators or flood teams with alerts lacking actionable insights, leaving you in a reactive, firefighting mode.
KubeWise is a revolutionary, AI-driven platform designed to transform your Kubernetes operations from reactive to proactive and intelligent.
It acts as an autonomous guardian for your clusters, seamlessly integrating:
KubeWise empowers you to achieve unprecedented levels of cluster stability, reduce operational overhead, and free up your valuable engineering resources to focus on innovation.
KubeWise isnβt just another monitoring tool. Itβs a comprehensive, intelligent operations platform built on unique differentiators:
π§ True End-to-End Autonomy:
KubeWise features a sophisticated AutonomousWorker
that orchestrates a continuous monitor -> detect -> analyze -> remediate -> verify loop. This true autonomy minimizes manual intervention, reduces human error, and ensures your cluster is being watched and managed 24/7.
Failed
Pods, NotReady
Nodes).kubectl
or structured K8s API commands tailored to the problem.π Adaptive Learning & Continuous Improvement: KubeWise is designed to evolve. Its ML models can be retrained automatically as more data becomes available, ensuring it adapts to the unique and changing dynamics of your cluster environment.
AUTO
vs. MANUAL
).CrashLoopBackOff
Pods, NotReady
Nodes, Failed
Deployments).kubectl
commands or structured Kubernetes API actions.kubectl
where possible./metrics
endpoint for Prometheus scraping..env
file) for seamless integration into various environments.See the power of KubeWise firsthand! This video demonstrates its key capabilities, from intelligent anomaly detection to AI-driven automated remediation.
KubeWise operates through a central AutonomousWorker
that intelligently orchestrates the entire monitoring, detection, analysis, and remediation lifecycle.
flowchart TD
subgraph "Kubernetes Cluster"
Prometheus[π Prometheus Metrics]
K8sAPI[βΈοΈ Kubernetes API]
end
subgraph "KubeWise Application"
A[PrometheusScraper] -->|Fetches Metrics| B(AutonomousWorker)
K8sAPI -->|Fetches K8s Status| J(K8sExecutor)
J -->|Direct Failure Info| B
B -->|Stores Metric History| C(Metric History - in-memory)
C -->|Provides Window| D(AnomalyDetector ML Model)
B -->|Latest Metrics| D
D -->|Anomaly/Failure Detected| E[AnomalyEventService]
E -->|Stores Event| F[(πΎ SQLite DB)]
B -->|Triggers Analysis/Remediation| G(GeminiService AI π§ )
E -->|Provides Event Context| G
J -->|Provides Cluster Context| G
G -->|Analysis & Suggestions| E
G -->|PromQL Queries| B
B -->|"Executes Remediation (AUTO Mode)"| J
J -->|Interacts With| K8sAPI
B -->|Triggers Verification| G
J -->|Fetches Current State| G
E -->|Provides Attempt Info| G
G -->|Verification Result| E
User[π€ User / API Client] <-->|"API Calls (FastAPI)"| KubeWiseAPI{API Routers}
KubeWiseAPI <--> E
KubeWiseAPI <--> G
KubeWiseAPI <--> J
KubeWiseAPI <--> D
KubeWiseAPI <--> B
end
Prometheus --> A
%% Define straight lines for all edges
linkStyle default stroke-width:2px,fill:none,stroke:gray;
classDef user fill:#d14,stroke:#333,stroke-width:2px
classDef api fill:#0366d6,stroke:#333,stroke-width:1px,color:#fff
classDef ai fill:#6f42c1,stroke:#333,stroke-width:1px,color:#fff
classDef storage fill:#2ea44f,stroke:#333,stroke-width:1px,color:#fff
class User user
class KubeWiseAPI api
class G ai
class F storage
PrometheusScraper
fetches the latest metrics based on active PromQL queries (configurable and potentially AI-generated).K8sExecutor
queries the Kubernetes API directly for resources in evident failure states (e.g., Failed
pods, NotReady
nodes), complementing metric-based detection.AutonomousWorker
ingests new data, updating the in-memory metric history for each monitored entity, crucial for time-series analysis.AnomalyDetector
performs its comprehensive checks:
K8sExecutor
.AnomalyEventService
and persisted to the SQLite database.GEMINI_AUTO_ANALYSIS=True
), GeminiService
analyzes the event context (metrics, K8s status, historical data) to:
AutonomousWorker
evaluates the situation: event severity, AI suggestions, fallback logic.K8sExecutor
βs safety checks.AUTO
mode and the command is deemed safe and appropriate for the context (critical issue, predicted failure, standard anomaly), K8sExecutor
applies it via the Kubernetes API.MANUAL
mode, validated AI-generated suggestions (or fallback suggestions) are stored with the event. Users can review these via the API and decide to trigger remediation manually.AutonomousWorker
triggers a verification step.
GEMINI_AUTO_VERIFICATION=True
, GeminiService
re-evaluates the entityβs current state (metrics, K8s status) to confirm if the issue is resolved.AnomalyDetector
can re-scan the entity.VerifiedResolved
or VerificationFailed
.AUTO
and MANUAL
operational modes.KubeWise is built for:
If youβre looking to elevate your Kubernetes management from reactive to intelligent and autonomous, KubeWise is for you.
Ensure you have the following before setting up KubeWise:
Python | Version 3.8 or higher |
Kubernetes Cluster | Access to a functioning Kubernetes cluster (e.g., Minikube, Kind, GKE, EKS, AKS). Your `kubectl` should be configured to connect to it. |
kubectl | Command-line tool configured to communicate with your cluster. |
Prometheus | Deployed within your cluster (e.g., via Helm chart like kube-prometheus-stack) and accessible to KubeWise (typically via port-forwarding for local development). |
Google Gemini API Key | **Essential for all AI-powered features.** Obtain one from Google AI Studio. |
jq (Optional, Recommended) | A lightweight command-line JSON processor. Useful for pretty-printing API responses. |
Note: KubeWise assumes it can reach Prometheus at the configured
PROMETHEUS_URL
. AI features (analysis, remediation suggestions, verification, query generation) will be significantly limited or disabled without a validGEMINI_API_KEY
.
Follow these steps to get KubeWise up and running:
git clone https://github.com/lohitkolluri/KubeWise.git
cd KubeWise
This isolates KubeWise dependencies.
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
Install all required Python packages.
pip install -r requirements.txt
π¦ Dependencies: Key packages include
fastapi
,uvicorn
,loguru
,requests
,kubernetes
,scikit-learn
,joblib
,google-generativeai
,sqlalchemy
,aiosqlite
,prometheus-client
,pydantic-settings
,httpx
,aiohttp
, andgunicorn
.
Copy the example .env
file and customize it.
cp .env.example .env
Open the .env
file in your editor and configure it. Crucially, set your GEMINI_API_KEY
.
.env
Variables:
```dotenv
# .env Example
LOG_LEVEL=INFO # Logging verbosity (DEBUG, INFO, WARNING, ERROR)
# --- Prometheus Connection ---
PROMETHEUS_URL=http://localhost:9090 # URL for your Prometheus instance
# --- Autonomous Worker ---
WORKER_SLEEP_INTERVAL_SECONDS=15 # How often the main worker loop runs
# --- Gemini AI Configuration ---
GEMINI_API_KEY="YOUR_GEMINI_API_KEY_HERE" # !!! REQUIRED for AI features !!!
GEMINI_MODEL_NAME="gemini-1.5-flash-latest" # Recommended model
GEMINI_AUTO_ANALYSIS=True # Enable AI for root cause analysis & remediation suggestions?
GEMINI_AUTO_VERIFICATION=True # Enable AI to verify remediation success?
GEMINI_AUTO_QUERY_GENERATION=False # Allow AI to suggest PromQL queries? (Experimental)
```
β οΈ Critical:
- Set your
GEMINI_API_KEY
. Without it, KubeWiseβs AI capabilities will be disabled, significantly limiting its functionality.- Ensure
PROMETHEUS_URL
points to your accessible Prometheus instance.
For local development, port-forward your clusterβs Prometheus service.
# Example: Find your Prometheus service (namespace and name might vary)
kubectl get svc -n monitoring
# Example: Port-forward (adjust service name and namespace if needed)
# This command assumes Prometheus service 'prometheus-kube-prometheus-prometheus' in 'monitoring' namespace
kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring
Keep this terminal running. Verify access by opening http://localhost:9090
in your browser. You should see the Prometheus UI.
KubeWise can be run in development mode (with Uvicorn for auto-reload) or production mode (with Gunicorn for robustness).
(Ensure your virtual environment is activated before running these commands)
Ideal for local development, offering features like auto-reload on code changes.
uvicorn kubewise.main:app --reload --host 0.0.0.0 --port 8000
--reload
: Automatically restarts the server when code changes are detected.--host 0.0.0.0
: Makes the server accessible on your local network.--port 8000
: Specifies the port to run on.Recommended for actual deployments. Gunicorn acts as a process manager for Uvicorn workers, providing better performance, scalability, and reliability.
gunicorn kubewise.main:app -w 4 -k uvicorn.workers.UvicornWorker -b 0.0.0.0:8000
-w 4
: Specifies the number of worker processes. A common recommendation is (2 * number_of_cpu_cores) + 1
. Adjust based on your serverβs resources.-k uvicorn.workers.UvicornWorker
: Tells Gunicorn to use Uvicorn workers to handle requests asynchronously.-b 0.0.0.0:8000
: Binds the server to the specified address and port.π Once KubeWise starts, it will initialize its components, connect to Prometheus and Kubernetes, start the
AutonomousWorker
, and begin its monitoring cycle. The API will be available athttp://localhost:8000
(or your configured host/port). Check the logs for startup messages and status.
Interact with KubeWise programmatically or via tools like curl
. The API is served from /api/v1/
.
(Base URL for examples: http://localhost:8000
)
We recommend piping curl
output to jq
for readable JSON: | jq
GET /
curl -X GET "http://localhost:8000/"
GET /api/v1/health/
curl -X GET "http://localhost:8000/api/v1/health/" | jq
GET /api/v1/setup/mode
AUTO
or MANUAL
).curl -X GET "http://localhost:8000/api/v1/setup/mode" | jq
PUT /api/v1/setup/mode
{"mode": "AUTO"}
or {"mode": "MANUAL"}
curl -X PUT "http://localhost:8000/api/v1/setup/mode" -H "Content-Type: application/json" -d '{"mode": "AUTO"}' | jq
GET /api/v1/setup/config
curl -X GET "http://localhost:8000/api/v1/setup/config" | jq
GET /api/v1/metrics/
curl -X GET "http://localhost:8000/api/v1/metrics/" | jq
GET /api/v1/metrics/queries
curl -X GET "http://localhost:8000/api/v1/metrics/queries" | jq
GET /api/v1/metrics/model/info
curl -X GET "http://localhost:8000/api/v1/metrics/model/info" | jq
POST /api/v1/metrics/model/retrain
(Potentially long-running)
curl -X POST "http://localhost:8000/api/v1/metrics/model/retrain" | jq
GET /api/v1/remediation/events
limit
(int), offset
(int), status
(e.g., Detected
, RemediationSuggested
, VerificationFailed
), severity
(e.g., Critical
, Warning
).curl -X GET "http://localhost:8000/api/v1/remediation/events?limit=10" | jq
curl -X GET "http://localhost:8000/api/v1/remediation/events?status=RemediationSuggested&severity=Critical" | jq
GET /api/v1/remediation/events/{event_id}
curl -X GET "http://localhost:8000/api/v1/remediation/events/YOUR_EVENT_ID" | jq
POST /api/v1/remediation/events/{event_id}/analyze
curl -X POST "http://localhost:8000/api/v1/remediation/events/YOUR_EVENT_ID/analyze" | jq
POST /api/v1/remediation/events/{event_id}/remediate
MANUAL
mode or if automated remediation failed/was skipped. {
"command_type": "kubectl", // or "k8s_api"
"command": "kubectl delete pod my-pod -n my-namespace", // Full command string
"parameters": null // or structured parameters for k8s_api
}
suggested_remediation
.curl -X POST "http://localhost:8000/api/v1/remediation/events/YOUR_EVENT_ID/remediate" | jq
curl -X POST "http://localhost:8000/api/v1/remediation/events/YOUR_EVENT_ID/remediate" \
-H "Content-Type: application/json" \
-d '{"command_type": "kubectl", "command": "kubectl rollout restart deployment/my-app -n production"}' | jq
POST /api/v1/remediation/events/{event_id}/verify
curl -X POST "http://localhost:8000/api/v1/remediation/events/YOUR_EVENT_ID/verify" | jq
GET /api/v1/remediation/commands/allowed
curl -X GET "http://localhost:8000/api/v1/remediation/commands/allowed" | jq
KubeWise is primarily configured via environment variables, typically loaded from a .env
file in the project root. You can also set these variables directly in your deployment environment.
The core configuration settings are defined in kubewise/core/config.py
using Pydantic, which provides defaults and type validation.
app/core/config.py
)Variable | Description | Default (config.py ) |
.env Example |
---|---|---|---|
LOG_LEVEL |
Controls application log verbosity (DEBUG, INFO, WARNING, ERROR, CRITICAL). | INFO |
INFO |
DEFAULT_REMEDIATION_MODE |
Initial global mode: MANUAL or AUTO . Can be changed via API. |
MANUAL |
MANUAL |
PROMETHEUS_URL |
Full URL for your Prometheus server. | http://localhost:9090 |
http://prometheus.svc:9090 |
PROMETHEUS_QUERY_TIMEOUT_SECONDS |
Timeout for PromQL queries. | 10 |
15 |
WORKER_SLEEP_INTERVAL_SECONDS |
Main loop interval for the AutonomousWorker. | 60 |
30 |
METRIC_HISTORY_WINDOW_SECONDS |
Duration of metric history kept in memory for anomaly detection. | 3600 (1 hour) |
7200 (2 hours) |
ANOMALY_DETECTION_MIN_SAMPLES |
Minimum data points required before ML model attempts prediction. | 100 |
50 |
GEMINI_API_KEY |
Your Google Gemini API Key (REQUIRED FOR AI FEATURES). | None |
"YOUR_GEMINI_API_KEY" |
GEMINI_MODEL_NAME |
Specific Gemini model to use. | gemini-1.5-flash-latest |
gemini-1.5-pro-latest |
GEMINI_AUTO_ANALYSIS |
Enable AI for automatic root cause analysis & remediation suggestions. | True |
True |
GEMINI_AUTO_VERIFICATION |
Enable AI to automatically verify remediation success. | True |
True |
GEMINI_AUTO_QUERY_GENERATION |
Allow AI to suggest PromQL queries (experimental). | False |
False |
GEMINI_SAFETY_SETTINGS_THRESHOLD |
Safety threshold for Gemini content generation (e.g., BLOCK_MEDIUM_AND_ABOVE ). |
BLOCK_MEDIUM_AND_ABOVE |
BLOCK_ONLY_HIGH |
AUTO_REMEDIATE_CRITICAL_SEVERITY |
In AUTO mode, automatically remediate issues marked βCriticalβ. |
True |
True |
AUTO_REMEDIATE_HIGH_SEVERITY |
In AUTO mode, automatically remediate issues marked βHighβ. |
True |
False |
AUTO_REMEDIATE_PREDICTED_ISSUES |
In AUTO mode, attempt remediation for βPredictedβ issues. |
False |
True |
REMEDIATION_VERIFICATION_DELAY_SECONDS |
Delay after remediation before starting verification. | 120 (2 minutes) |
180 (3 minutes) |
DB_URL |
SQLite database connection string. | sqlite+aiosqlite:///./kubewise.db |
sqlite+aiosqlite:///./data/main.db |
API_V1_STR |
API prefix. | /api/v1 |
/api/v1 |
π For the most up-to-date and comprehensive list of configurable settings, always refer to
kubewise/core/config.py
. You can override any of these settings by creating an equivalent uppercase variable in your.env
file.
KubeWise is distributed under a Proprietary License. Please see the LICENSE
file in the repository for more detailed information.