KubeWise

# KubeWise πŸ›‘οΈ

Your AI-Powered Guardian for Kubernetes: Autonomously Detect, Diagnose, Defend & Dominate Cluster Complexity

``` ╔═══════════════════════════════════════════════════════════════════╗ β•‘ β•‘ β•‘ β–ˆβ–ˆβ•— β–ˆβ–ˆβ•—β–ˆβ–ˆβ•— β–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•— β–ˆβ–ˆβ•—β–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β•‘ β•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β•β•β•β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β•β•β•β–ˆβ–ˆβ•”β•β•β•β•β• β•‘ β•‘ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β• β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•‘ β–ˆβ•— β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β•‘ β•‘ β–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•— β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β• β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β•šβ•β•β•β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β• β•‘ β•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•—β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β•šβ–ˆβ–ˆβ–ˆβ•”β–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β•‘ β•‘ β•šβ•β• β•šβ•β• β•šβ•β•β•β•β•β• β•šβ•β•β•β•β•β• β•šβ•β•β•β•β•β•β• β•šβ•β•β•β•šβ•β•β• β•šβ•β•β•šβ•β•β•β•β•β•β•β•šβ•β•β•β•β•β•β• β•‘ β•‘ β•‘ β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β• ``` **Stop firefighting in Kubernetes. Start innovating. KubeWise offers intelligent, autonomous operations to ensure your clusters are stable, resilient, and performant.** [![License: Proprietary](https://img.shields.io/badge/License-Proprietary-red.svg)](LICENSE) [![Python Version](https://img.shields.io/badge/python-3.8%2B-blue.svg)](https://www.python.org/downloads/) [![Status](https://img.shields.io/badge/status-active-success.svg)]()

β€œKubeWise: Let AI manage the chaos, so you can focus on innovation.”


πŸ“‘ Table of Contents


🀯 The Challenge: Kubernetes Complexity

Managing modern Kubernetes clusters is a relentless battle against complexity. SREs and DevOps teams are constantly challenged to:

Traditional monitoring tools often provide lagging indicators or flood teams with alerts lacking actionable insights, leaving you in a reactive, firefighting mode.


✨ Introducing KubeWise: Your AI Co-Pilot for Kubernetes

KubeWise is a revolutionary, AI-driven platform designed to transform your Kubernetes operations from reactive to proactive and intelligent.

It acts as an autonomous guardian for your clusters, seamlessly integrating:

KubeWise empowers you to achieve unprecedented levels of cluster stability, reduce operational overhead, and free up your valuable engineering resources to focus on innovation.


πŸ† Why KubeWise? The KubeWise Advantage (USPs)

KubeWise isn’t just another monitoring tool. It’s a comprehensive, intelligent operations platform built on unique differentiators:


🌟 Key Features In-Depth


🎬 Watch KubeWise in Action: Demo

See the power of KubeWise firsthand! This video demonstrates its key capabilities, from intelligent anomaly detection to AI-driven automated remediation.

[![KubeWise Demo Video](https://img.youtube.com/vi/PxobbNKy1Kc/0.jpg)](https://youtu.be/PxobbNKy1Kc) **(Click the image to watch the demo on YouTube)**

πŸ—οΈ Architecture & Workflow

KubeWise operates through a central AutonomousWorker that intelligently orchestrates the entire monitoring, detection, analysis, and remediation lifecycle.

πŸ“Š Architecture Diagram

flowchart TD
    subgraph "Kubernetes Cluster"
        Prometheus[πŸ“ˆ Prometheus Metrics]
        K8sAPI[☸️ Kubernetes API]
    end

    subgraph "KubeWise Application"
        A[PrometheusScraper] -->|Fetches Metrics| B(AutonomousWorker)
        K8sAPI -->|Fetches K8s Status| J(K8sExecutor)
        J -->|Direct Failure Info| B

        B -->|Stores Metric History| C(Metric History - in-memory)
        C -->|Provides Window| D(AnomalyDetector ML Model)
        B -->|Latest Metrics| D

        D -->|Anomaly/Failure Detected| E[AnomalyEventService]
        E -->|Stores Event| F[(πŸ’Ύ SQLite DB)]

        B -->|Triggers Analysis/Remediation| G(GeminiService AI 🧠)
        E -->|Provides Event Context| G
        J -->|Provides Cluster Context| G

        G -->|Analysis & Suggestions| E
        G -->|PromQL Queries| B

        B -->|"Executes Remediation (AUTO Mode)"| J
        J -->|Interacts With| K8sAPI

        B -->|Triggers Verification| G
        J -->|Fetches Current State| G
        E -->|Provides Attempt Info| G
        G -->|Verification Result| E

        User[πŸ‘€ User / API Client] <-->|"API Calls (FastAPI)"| KubeWiseAPI{API Routers}
        KubeWiseAPI <--> E
        KubeWiseAPI <--> G
        KubeWiseAPI <--> J
        KubeWiseAPI <--> D
        KubeWiseAPI <--> B
    end

    Prometheus --> A

    %% Define straight lines for all edges
    linkStyle default stroke-width:2px,fill:none,stroke:gray;

    classDef user fill:#d14,stroke:#333,stroke-width:2px
    classDef api fill:#0366d6,stroke:#333,stroke-width:1px,color:#fff
    classDef ai fill:#6f42c1,stroke:#333,stroke-width:1px,color:#fff
    classDef storage fill:#2ea44f,stroke:#333,stroke-width:1px,color:#fff

    class User user
    class KubeWiseAPI api
    class G ai
    class F storage

πŸ”„ Operational Workflow: 10-Step Process

  1. Scrape Metrics: PrometheusScraper fetches the latest metrics based on active PromQL queries (configurable and potentially AI-generated).
  2. Direct K8s Scan: K8sExecutor queries the Kubernetes API directly for resources in evident failure states (e.g., Failed pods, NotReady nodes), complementing metric-based detection.
  3. Process & Store Metrics: AutonomousWorker ingests new data, updating the in-memory metric history for each monitored entity, crucial for time-series analysis.
  4. Detect Anomalies & Failures: AnomalyDetector performs its comprehensive checks:
    • Runs the Isolation Forest model on metric history.
    • Evaluates direct K8s status from K8sExecutor.
    • Checks for threshold breaches on latest metrics.
    • Applies predictive forecasting algorithms.
  5. Record Event: If an anomaly, failure, or predicted issue is detected, an event is created via AnomalyEventService and persisted to the SQLite database.
  6. AI-Powered Analysis (Gemini): If enabled (GEMINI_AUTO_ANALYSIS=True), GeminiService analyzes the event context (metrics, K8s status, historical data) to:
    • Determine the likely root cause.
    • Generate targeted remediation command suggestions.
    • These insights are stored with the event.
  7. Decide & Remediate (AUTO Mode):
    • AutonomousWorker evaluates the situation: event severity, AI suggestions, fallback logic.
    • It validates proposed commands using K8sExecutor’s safety checks.
    • If in AUTO mode and the command is deemed safe and appropriate for the context (critical issue, predicted failure, standard anomaly), K8sExecutor applies it via the Kubernetes API.
    • All remediation attempts are recorded in the event history.
  8. Suggest & Await (MANUAL Mode): If in MANUAL mode, validated AI-generated suggestions (or fallback suggestions) are stored with the event. Users can review these via the API and decide to trigger remediation manually.
  9. Verify Remediation (AUTO Mode & AI-Enabled): After a configurable delay post-remediation, AutonomousWorker triggers a verification step.
    • If GEMINI_AUTO_VERIFICATION=True, GeminiService re-evaluates the entity’s current state (metrics, K8s status) to confirm if the issue is resolved.
    • Alternatively, AnomalyDetector can re-scan the entity.
    • The event status is updated to VerifiedResolved or VerificationFailed.
  10. User Interaction via API: Throughout this cycle, users can interact with KubeWise via its FastAPI interface to:
    • View system health and configuration.
    • List and inspect anomaly events.
    • Manually trigger analysis or remediation for specific events.
    • Switch between AUTO and MANUAL operational modes.

🎯 Who is KubeWise For?

KubeWise is built for:

If you’re looking to elevate your Kubernetes management from reactive to intelligent and autonomous, KubeWise is for you.


βš™οΈ Getting Started: Prerequisites

Ensure you have the following before setting up KubeWise:

Python Version 3.8 or higher
Kubernetes Cluster Access to a functioning Kubernetes cluster (e.g., Minikube, Kind, GKE, EKS, AKS). Your `kubectl` should be configured to connect to it.
kubectl Command-line tool configured to communicate with your cluster.
Prometheus Deployed within your cluster (e.g., via Helm chart like kube-prometheus-stack) and accessible to KubeWise (typically via port-forwarding for local development).
Google Gemini API Key **Essential for all AI-powered features.** Obtain one from Google AI Studio.
jq (Optional, Recommended) A lightweight command-line JSON processor. Useful for pretty-printing API responses.

Note: KubeWise assumes it can reach Prometheus at the configured PROMETHEUS_URL. AI features (analysis, remediation suggestions, verification, query generation) will be significantly limited or disabled without a valid GEMINI_API_KEY.


πŸš€ Setup & Installation

Follow these steps to get KubeWise up and running:

1. Clone the Repository

git clone https://github.com/lohitkolluri/KubeWise.git
cd KubeWise

2. Create and Activate a Python Virtual Environment

This isolates KubeWise dependencies.

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install Dependencies

Install all required Python packages.

pip install -r requirements.txt

πŸ“¦ Dependencies: Key packages include fastapi, uvicorn, loguru, requests, kubernetes, scikit-learn, joblib, google-generativeai, sqlalchemy, aiosqlite, prometheus-client, pydantic-settings, httpx, aiohttp, and gunicorn.

4. Configure Environment Variables

Copy the example .env file and customize it.

cp .env.example .env

Open the .env file in your editor and configure it. Crucially, set your GEMINI_API_KEY.

Key .env Variables: ```dotenv # .env Example LOG_LEVEL=INFO # Logging verbosity (DEBUG, INFO, WARNING, ERROR) # --- Prometheus Connection --- PROMETHEUS_URL=http://localhost:9090 # URL for your Prometheus instance # --- Autonomous Worker --- WORKER_SLEEP_INTERVAL_SECONDS=15 # How often the main worker loop runs # --- Gemini AI Configuration --- GEMINI_API_KEY="YOUR_GEMINI_API_KEY_HERE" # !!! REQUIRED for AI features !!! GEMINI_MODEL_NAME="gemini-1.5-flash-latest" # Recommended model GEMINI_AUTO_ANALYSIS=True # Enable AI for root cause analysis & remediation suggestions? GEMINI_AUTO_VERIFICATION=True # Enable AI to verify remediation success? GEMINI_AUTO_QUERY_GENERATION=False # Allow AI to suggest PromQL queries? (Experimental) ```

⚠️ Critical:

5. Ensure Prometheus is Accessible

For local development, port-forward your cluster’s Prometheus service.

# Example: Find your Prometheus service (namespace and name might vary)
kubectl get svc -n monitoring

# Example: Port-forward (adjust service name and namespace if needed)
# This command assumes Prometheus service 'prometheus-kube-prometheus-prometheus' in 'monitoring' namespace
kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring

Keep this terminal running. Verify access by opening http://localhost:9090 in your browser. You should see the Prometheus UI.


▢️ Running the Application

KubeWise can be run in development mode (with Uvicorn for auto-reload) or production mode (with Gunicorn for robustness).

| Mode | Recommended Use Case | Command | |-----------------|-------------------------------|----------------------------------------------------------------------------------| | **Development** | Local testing, debugging | `uvicorn kubewise.main:app --reload --host 0.0.0.0 --port 8000` | | **Production** | Deployment, stable operation | `gunicorn kubewise.main:app -w 4 -k uvicorn.workers.UvicornWorker -b 0.0.0.0:8000` |

(Ensure your virtual environment is activated before running these commands)

Development Mode (Uvicorn)

Ideal for local development, offering features like auto-reload on code changes.

uvicorn kubewise.main:app --reload --host 0.0.0.0 --port 8000

Production Mode (Gunicorn + Uvicorn Workers)

Recommended for actual deployments. Gunicorn acts as a process manager for Uvicorn workers, providing better performance, scalability, and reliability.

gunicorn kubewise.main:app -w 4 -k uvicorn.workers.UvicornWorker -b 0.0.0.0:8000

πŸš€ Once KubeWise starts, it will initialize its components, connect to Prometheus and Kubernetes, start the AutonomousWorker, and begin its monitoring cycle. The API will be available at http://localhost:8000 (or your configured host/port). Check the logs for startup messages and status.


πŸ”Œ API Endpoints & Usage

Interact with KubeWise programmatically or via tools like curl. The API is served from /api/v1/.

(Base URL for examples: http://localhost:8000) We recommend piping curl output to jq for readable JSON: | jq

Root & Health Endpoints

Setup & Configuration Endpoints

Metrics & Anomaly Model Endpoints

Anomaly Events & Remediation Endpoints


πŸ”§ Configuration Deep Dive

KubeWise is primarily configured via environment variables, typically loaded from a .env file in the project root. You can also set these variables directly in your deployment environment. The core configuration settings are defined in kubewise/core/config.py using Pydantic, which provides defaults and type validation.

Key Configuration Variables (Defaults in app/core/config.py)

Variable Description Default (config.py) .env Example
LOG_LEVEL Controls application log verbosity (DEBUG, INFO, WARNING, ERROR, CRITICAL). INFO INFO
DEFAULT_REMEDIATION_MODE Initial global mode: MANUAL or AUTO. Can be changed via API. MANUAL MANUAL
PROMETHEUS_URL Full URL for your Prometheus server. http://localhost:9090 http://prometheus.svc:9090
PROMETHEUS_QUERY_TIMEOUT_SECONDS Timeout for PromQL queries. 10 15
WORKER_SLEEP_INTERVAL_SECONDS Main loop interval for the AutonomousWorker. 60 30
METRIC_HISTORY_WINDOW_SECONDS Duration of metric history kept in memory for anomaly detection. 3600 (1 hour) 7200 (2 hours)
ANOMALY_DETECTION_MIN_SAMPLES Minimum data points required before ML model attempts prediction. 100 50
GEMINI_API_KEY Your Google Gemini API Key (REQUIRED FOR AI FEATURES). None "YOUR_GEMINI_API_KEY"
GEMINI_MODEL_NAME Specific Gemini model to use. gemini-1.5-flash-latest gemini-1.5-pro-latest
GEMINI_AUTO_ANALYSIS Enable AI for automatic root cause analysis & remediation suggestions. True True
GEMINI_AUTO_VERIFICATION Enable AI to automatically verify remediation success. True True
GEMINI_AUTO_QUERY_GENERATION Allow AI to suggest PromQL queries (experimental). False False
GEMINI_SAFETY_SETTINGS_THRESHOLD Safety threshold for Gemini content generation (e.g., BLOCK_MEDIUM_AND_ABOVE). BLOCK_MEDIUM_AND_ABOVE BLOCK_ONLY_HIGH
AUTO_REMEDIATE_CRITICAL_SEVERITY In AUTO mode, automatically remediate issues marked β€˜Critical’. True True
AUTO_REMEDIATE_HIGH_SEVERITY In AUTO mode, automatically remediate issues marked β€˜High’. True False
AUTO_REMEDIATE_PREDICTED_ISSUES In AUTO mode, attempt remediation for β€˜Predicted’ issues. False True
REMEDIATION_VERIFICATION_DELAY_SECONDS Delay after remediation before starting verification. 120 (2 minutes) 180 (3 minutes)
DB_URL SQLite database connection string. sqlite+aiosqlite:///./kubewise.db sqlite+aiosqlite:///./data/main.db
API_V1_STR API prefix. /api/v1 /api/v1

πŸ“ For the most up-to-date and comprehensive list of configurable settings, always refer to kubewise/core/config.py. You can override any of these settings by creating an equivalent uppercase variable in your .env file.


πŸ“„ License

KubeWise is distributed under a Proprietary License. Please see the LICENSE file in the repository for more detailed information.


Thank you for exploring KubeWise! We believe it can significantly enhance your Kubernetes operations.