llm-incident-manager

Crates.iollm-incident-manager
lib.rsllm-incident-manager
version1.0.1
created_at2025-11-14 00:01:11.878753+00
updated_at2025-11-14 02:13:01.397688+00
descriptionEnterprise-grade incident management system for LLM operations
homepagehttps://github.com/globalbusinessadvisors/llm-incident-manager
repositoryhttps://github.com/globalbusinessadvisors/llm-incident-manager
max_upload_size
id1932039
size1,923,433
GBA (globalbusinessadvisors)

documentation

https://docs.rs/llm-incident-manager

README

LLM Incident Manager

Overview

LLM Incident Manager is an enterprise-grade, production-ready incident management system built in Rust, designed specifically for LLM DevOps ecosystems. It provides intelligent incident detection, classification, enrichment, correlation, routing, escalation, and automated resolution capabilities for modern LLM infrastructure.

Key Features

Core Capabilities

  • 🚀 High Performance: Built in Rust with async/await for maximum throughput and minimal latency
  • ðŸĪ– ML-Powered Classification: Machine learning-based incident classification with confidence scoring
  • 🔍 Context Enrichment: Automatic enrichment with historical data, service info, and team context
  • 🔗 Intelligent Correlation: Groups related incidents to reduce alert fatigue
  • ⚡ Smart Escalation: Policy-based escalation with multi-level notification chains
  • 📊 Persistent Storage: PostgreSQL and in-memory storage implementations
  • ðŸŽŊ Smart Routing: Policy-based routing with team and severity-based rules
  • 🔔 Multi-Channel Notifications: Email, Slack, PagerDuty, webhooks
  • ðŸĪ Automated Playbooks: Execute automated remediation workflows
  • 📝 Complete Audit Trail: Full incident lifecycle tracking

Implemented Subsystems

1. Escalation Engine ✅

  • Multi-level escalation policies
  • Time-based automatic escalation
  • Configurable notification channels per level
  • Target types: Users, Teams, On-Call schedules
  • Pause/resume/resolve escalation flows
  • Real-time escalation state tracking
  • Documentation: ESCALATION_GUIDE.md

2. Persistent Storage ✅

  • PostgreSQL backend with connection pooling
  • In-memory storage for testing/development
  • Trait-based abstraction for extensibility
  • Transaction support for data consistency
  • Full incident lifecycle persistence
  • Query optimizations and indexing
  • Documentation: STORAGE_IMPLEMENTATION.md

3. Correlation Engine ✅

  • Time-window based correlation
  • Multi-strategy correlation: Source, Type, Similarity, Tag, Service
  • Dynamic correlation groups
  • Configurable thresholds and windows
  • Pattern detection across incidents
  • Graph-based relationship tracking
  • Documentation: CORRELATION_GUIDE.md

4. ML Classification ✅

  • Automated severity classification
  • Multi-model ensemble architecture
  • Feature extraction from incidents
  • Confidence scoring
  • Incremental learning with feedback
  • Model versioning and persistence
  • Real-time classification API
  • Documentation: ML_CLASSIFICATION_GUIDE.md

5. Context Enrichment ✅

  • Historical incident analysis with similarity matching
  • Service catalog integration (CMDB)
  • Team and on-call information
  • External API integrations (Prometheus, Elasticsearch)
  • Parallel enrichment pipeline
  • Intelligent caching with TTL
  • Configurable enrichers and priorities
  • Documentation: ENRICHMENT_GUIDE.md

6. Deduplication Engine ✅

  • Fingerprint-based duplicate detection
  • Time-window deduplication
  • Automatic incident merging
  • Alert correlation

7. Notification Service ✅

  • Multi-channel delivery (Email, Slack, PagerDuty)
  • Template-based formatting
  • Rate limiting and throttling
  • Delivery confirmation

8. Playbook Automation ✅

  • Trigger-based playbook execution
  • Step-by-step action execution
  • Auto-execution on incident creation
  • Manual playbook execution

9. Routing Engine ✅

  • Rule-based incident routing
  • Team assignment suggestions
  • Severity-based routing
  • Service-aware routing

10. LLM Integrations ✅

  • Sentinel Client: Monitoring & anomaly detection with ML-powered analysis
  • Shield Client: Security threat analysis and mitigation planning
  • Edge-Agent Client: Distributed edge inference with offline queue management
  • Governance Client: Multi-framework compliance (GDPR, HIPAA, SOC2, PCI, ISO27001)
  • Enterprise features: Exponential backoff retry, circuit breaker, rate limiting
  • Comprehensive error handling and observability

11. GraphQL API with WebSocket Streaming ✅

  • Full-featured GraphQL API alongside REST
  • Real-time WebSocket subscriptions for incident updates
  • Type-safe schema with queries, mutations, and subscriptions
  • DataLoaders for efficient batch loading and N+1 prevention
  • GraphQL Playground for interactive API exploration
  • Support for filtering, pagination, and complex queries
  • Documentation: GRAPHQL_GUIDE.md, WEBSOCKET_STREAMING_GUIDE.md

12. Metrics & Observability ✅

  • Prometheus Integration: Native Prometheus metrics export on port 9090
  • Real-time Performance Tracking: Request rates, latency, success/error rates
  • Integration Metrics: Per-integration monitoring (Sentinel, Shield, Edge-Agent, Governance)
  • System Metrics: Processing pipeline, correlation, enrichment, ML classification
  • Zero-Overhead Collection: Lock-free atomic operations with <1Ξs recording time
  • Grafana Dashboards: Pre-built dashboards for system overview and deep-dive analysis
  • Alert Rules: Production-ready alerting for critical conditions
  • Documentation: METRICS_GUIDE.md | Implementation | Runbook

13. Circuit Breaker Pattern ✅

  • Resilience Pattern: Prevent cascading failures with automatic circuit breaking
  • State Management: Closed, Open, and Half-Open states with intelligent transitions
  • Per-Service Configuration: Individual circuit breakers for each external dependency
  • Fast Failure: Millisecond response time when circuit is open (vs. 30s+ timeouts)
  • Automatic Recovery: Self-healing with configurable recovery strategies
  • Fallback Support: Graceful degradation with fallback mechanisms
  • Comprehensive Metrics: Real-time state tracking and Prometheus integration
  • Manual Control: API endpoints for operational override and testing
  • Documentation: CIRCUIT_BREAKER_GUIDE.md | API Reference | Integration Guide | Operations

Architecture

System Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                        LLM Incident Manager                          │
├──────────────────────────────────────────────────────────────────────â”Ī
│                                                                       │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────────┐  │
│  │   REST API   │  │   gRPC API   │  │    GraphQL API           │  │
│  │  (HTTP/JSON) │  │ (Protobuf)   │  │ (Queries/Mutations/Subs) │  │
│  └──────┮───────┘  └──────┮───────┘  └──────┮───────────────────┘  │
│         │                  │                  │                       │
│         └──────────────────┾──────────────────┘                      │
│                            ▾                                          │
│                 ┌─────────────────────┐                              │
│                 │ IncidentProcessor   │                              │
│                 │  - Deduplication    │                              │
│                 │  - Classification   │                              │
│                 │  - Enrichment       │                              │
│                 │  - Correlation      │                              │
│                 └─────────┮───────────┘                              │
│                           │                                           │
│         ┌─────────────────┾─────────────────┐                       │
│         ▾                 ▾                 ▾                        │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                │
│  │  Escalation │  │ Notification │  │  Playbook   │                │
│  │   Engine    │  │   Service    │  │   Service   │                │
│  └─────────────┘  └─────────────┘  └─────────────┘                │
│         │                 │                 │                        │
│         └─────────────────┾─────────────────┘                       │
│                           ▾                                           │
│                 ┌─────────────────────┐                              │
│                 │   Storage Layer     │                              │
│                 │  - PostgreSQL       │                              │
│                 │  - In-Memory        │                              │
│                 └─────────────────────┘                              │
└──────────────────────────────────────────────────────────────────────┘

Data Flow

Alert → Deduplication → ML Classification → Context Enrichment
                                                     ↓
                                              Correlation
                                                     ↓
                        Routing ← ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
                           ↓
        ┌──────────────────┾──────────────────┐
        ▾                  ▾                   ▾
  Notifications      Escalation           Playbooks

Quick Start

Prerequisites

  • Rust 1.70+ (2021 edition)
  • PostgreSQL 14+ (optional, for persistent storage)
  • Redis (optional, for distributed caching)

Installation

# Clone repository
git clone https://github.com/globalbusinessadvisors/llm-incident-manager.git
cd llm-incident-manager

# Build
cargo build --release

# Run tests
cargo test --all-features

# Run with default configuration (in-memory storage)
cargo run --release

Basic Usage

use llm_incident_manager::{
    Config,
    models::{Alert, Incident, Severity, IncidentType},
    processing::{IncidentProcessor, DeduplicationEngine},
    state::InMemoryStore,
    escalation::EscalationEngine,
    enrichment::EnrichmentService,
    correlation::CorrelationEngine,
    ml::MLService,
};
use std::sync::Arc;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Initialize storage
    let store = Arc::new(InMemoryStore::new());

    // Create deduplication engine
    let dedup_engine = Arc::new(DeduplicationEngine::new(store.clone(), 900));

    // Create incident processor
    let mut processor = IncidentProcessor::new(store.clone(), dedup_engine);

    // Optional: Add escalation engine
    let escalation_engine = Arc::new(EscalationEngine::new());
    processor.set_escalation_engine(escalation_engine);

    // Optional: Add ML classification
    let ml_service = Arc::new(MLService::new(Default::default()));
    ml_service.start().await?;
    processor.set_ml_service(ml_service);

    // Optional: Add context enrichment
    let enrichment_config = Default::default();
    let enrichment_service = Arc::new(
        EnrichmentService::new(enrichment_config, store.clone())
    );
    enrichment_service.start().await?;
    processor.set_enrichment_service(enrichment_service);

    // Optional: Add correlation engine
    let correlation_engine = Arc::new(
        CorrelationEngine::new(store.clone(), Default::default())
    );
    processor.set_correlation_engine(correlation_engine);

    // Process an alert
    let alert = Alert::new(
        "ext-123".to_string(),
        "monitoring".to_string(),
        "High CPU Usage".to_string(),
        "CPU usage exceeded 90% threshold".to_string(),
        Severity::P1,
        IncidentType::Infrastructure,
    );

    let ack = processor.process_alert(alert).await?;
    println!("Incident created: {:?}", ack.incident_id);

    Ok(())
}

Configuration

Environment Variables

# Database
DATABASE_URL=postgresql://user:password@localhost/incident_manager
DATABASE_POOL_SIZE=20

# Redis (optional)
REDIS_URL=redis://localhost:6379

# API Server
API_HOST=0.0.0.0
API_PORT=3000

# gRPC Server
GRPC_HOST=0.0.0.0
GRPC_PORT=50051

# Feature Flags
ENABLE_ML_CLASSIFICATION=true
ENABLE_ENRICHMENT=true
ENABLE_CORRELATION=true
ENABLE_ESCALATION=true

# Logging
RUST_LOG=info,llm_incident_manager=debug

Configuration File (config.yaml)

instance_id: "standalone-001"

# Storage configuration
storage:
  type: "postgresql"  # or "memory"
  connection_string: "postgresql://localhost/incident_manager"
  pool_size: 20

# ML Configuration
ml:
  enabled: true
  confidence_threshold: 0.7
  model_path: "./models"
  auto_train: true
  training_batch_size: 100

# Enrichment Configuration
enrichment:
  enabled: true
  enable_historical: true
  enable_service: true
  enable_team: true
  timeout_secs: 10
  cache_ttl_secs: 300
  async_enrichment: true
  max_concurrent: 5
  similarity_threshold: 0.5

# Correlation Configuration
correlation:
  enabled: true
  time_window_secs: 300
  min_incidents: 2
  max_group_size: 50
  enable_source: true
  enable_type: true
  enable_similarity: true
  enable_tags: true
  enable_service: true

# Escalation Configuration
escalation:
  enabled: true
  default_timeout_secs: 300

# Deduplication Configuration
deduplication:
  window_secs: 900
  fingerprint_enabled: true

# Notification Configuration
notifications:
  channels:
    - type: "email"
      enabled: true
    - type: "slack"
      enabled: true
      webhook_url: "https://hooks.slack.com/..."
    - type: "pagerduty"
      enabled: true
      integration_key: "..."

API Examples

WebSocket Streaming (Real-Time Updates)

The LLM Incident Manager provides a GraphQL WebSocket API for real-time incident streaming. This allows clients to subscribe to incident events and receive immediate notifications.

Quick Start:

import { createClient } from 'graphql-ws';

const client = createClient({
  url: 'ws://localhost:8080/graphql/ws',
  connectionParams: {
    Authorization: 'Bearer YOUR_JWT_TOKEN'
  }
});

// Subscribe to critical incidents
client.subscribe(
  {
    query: `
      subscription {
        criticalIncidents {
          id
          title
          severity
          state
          createdAt
        }
      }
    `
  },
  {
    next: (data) => {
      console.log('Critical incident:', data.criticalIncidents);
    },
    error: (error) => console.error('Subscription error:', error),
    complete: () => console.log('Subscription completed')
  }
);

Available Subscriptions:

  • criticalIncidents - Subscribe to P0 and P1 incidents
  • incidentUpdates - Subscribe to incident lifecycle events
  • newIncidents - Subscribe to newly created incidents
  • incidentStateChanges - Subscribe to state transitions
  • alerts - Subscribe to incoming alert submissions

Documentation:

REST API

# Create an incident
curl -X POST http://localhost:3000/api/v1/incidents \
  -H "Content-Type: application/json" \
  -d '{
    "source": "monitoring",
    "title": "High Memory Usage",
    "description": "Memory usage exceeded 85% threshold",
    "severity": "P2",
    "incident_type": "Infrastructure"
  }'

# Get incident
curl http://localhost:3000/api/v1/incidents/{incident_id}

# Acknowledge incident
curl -X POST http://localhost:3000/api/v1/incidents/{incident_id}/acknowledge \
  -H "Content-Type: application/json" \
  -d '{"actor": "user@example.com"}'

# Resolve incident
curl -X POST http://localhost:3000/api/v1/incidents/{incident_id}/resolve \
  -H "Content-Type: application/json" \
  -d '{
    "resolved_by": "user@example.com",
    "method": "Manual",
    "notes": "Restarted service",
    "root_cause": "Memory leak in application"
  }'

gRPC API

service IncidentService {
  rpc CreateIncident(CreateIncidentRequest) returns (CreateIncidentResponse);
  rpc GetIncident(GetIncidentRequest) returns (Incident);
  rpc UpdateIncident(UpdateIncidentRequest) returns (Incident);
  rpc StreamIncidents(StreamIncidentsRequest) returns (stream Incident);
  rpc AnalyzeCorrelations(AnalyzeCorrelationsRequest) returns (CorrelationResult);
}

GraphQL API

The GraphQL API provides a flexible, type-safe interface with real-time subscriptions:

# Query incidents with advanced filtering
query GetIncidents {
  incidents(
    first: 20
    filter: {
      severity: [P0, P1]
      status: [NEW, ACKNOWLEDGED]
      environment: [PRODUCTION]
    }
    orderBy: { field: CREATED_AT, direction: DESC }
  ) {
    edges {
      node {
        id
        title
        severity
        status
        assignedTo {
          name
          email
        }
        sla {
          resolutionDeadline
          resolutionBreached
        }
      }
    }
    pageInfo {
      hasNextPage
      endCursor
    }
  }
}

# Subscribe to real-time incident updates
subscription IncidentUpdates {
  incidentUpdated(filter: { severity: [P0, P1] }) {
    incident {
      id
      title
      status
    }
    updateType
    changedFields
  }
}

GraphQL Endpoints:

  • Query/Mutation: POST http://localhost:8080/graphql
  • Subscriptions: WS ws://localhost:8080/graphql
  • Playground: GET http://localhost:8080/graphql/playground

Documentation:

Feature Guides

1. Escalation Engine

Create escalation policies and automatically escalate incidents based on time and severity:

use llm_incident_manager::escalation::{
    EscalationPolicy, EscalationLevel, EscalationTarget, TargetType,
};

// Define escalation policy
let policy = EscalationPolicy {
    name: "Critical Production Incidents".to_string(),
    levels: vec![
        EscalationLevel {
            level: 1,
            name: "L1 On-Call".to_string(),
            targets: vec![
                EscalationTarget {
                    target_type: TargetType::OnCall,
                    identifier: "platform-team".to_string(),
                }
            ],
            escalate_after_secs: 300,  // 5 minutes
            channels: vec!["pagerduty".to_string(), "slack".to_string()],
        },
        EscalationLevel {
            level: 2,
            name: "Engineering Lead".to_string(),
            targets: vec![
                EscalationTarget {
                    target_type: TargetType::User,
                    identifier: "eng-lead@example.com".to_string(),
                }
            ],
            escalate_after_secs: 900,  // 15 minutes
            channels: vec!["pagerduty".to_string(), "sms".to_string()],
        },
    ],
    // ... conditions
};

escalation_engine.register_policy(policy);

See ESCALATION_GUIDE.md for complete documentation.

2. Context Enrichment

Automatically enrich incidents with historical data, service information, and team context:

use llm_incident_manager::enrichment::{EnrichmentConfig, EnrichmentService};

let mut config = EnrichmentConfig::default();
config.enable_historical = true;
config.enable_service = true;
config.enable_team = true;
config.similarity_threshold = 0.5;

let service = EnrichmentService::new(config, store);
service.start().await?;

// Enrichment happens automatically in the processor
let context = service.enrich_incident(&incident).await?;

// Access enriched data
if let Some(historical) = context.historical {
    println!("Found {} similar incidents", historical.similar_incidents.len());
}

See ENRICHMENT_GUIDE.md for complete documentation.

3. Correlation Engine

Group related incidents to reduce alert fatigue:

use llm_incident_manager::correlation::{CorrelationEngine, CorrelationConfig};

let mut config = CorrelationConfig::default();
config.time_window_secs = 300;  // 5 minutes
config.enable_similarity = true;
config.enable_source = true;

let engine = CorrelationEngine::new(store, config);
let result = engine.analyze_incident(&incident).await?;

if result.has_correlations() {
    println!("Found {} related incidents", result.correlation_count());
}

See CORRELATION_GUIDE.md for complete documentation.

4. ML Classification

Automatically classify incident severity using machine learning:

use llm_incident_manager::ml::{MLService, MLConfig};

let config = MLConfig::default();
let service = MLService::new(config);
service.start().await?;

// Classification happens automatically
let prediction = service.predict_severity(&incident).await?;
println!("Predicted severity: {:?} (confidence: {:.2})",
    prediction.predicted_severity,
    prediction.confidence
);

// Train with feedback
service.add_training_sample(&incident).await?;
service.trigger_training().await?;

See ML_CLASSIFICATION_GUIDE.md for complete documentation.

5. Circuit Breakers

Protect your system from cascading failures with automatic circuit breaking:

use llm_incident_manager::circuit_breaker::CircuitBreaker;
use std::time::Duration;

// Create circuit breaker for external service
let circuit_breaker = CircuitBreaker::new("sentinel-api")
    .failure_threshold(5)       // Open after 5 failures
    .timeout(Duration::from_secs(60))  // Wait 60s before testing recovery
    .success_threshold(2)       // Close after 2 successful tests
    .build();

// Execute request through circuit breaker
let result = circuit_breaker.call(|| async {
    sentinel_client.fetch_alerts(Some(10)).await
}).await;

match result {
    Ok(alerts) => {
        println!("Fetched {} alerts", alerts.len());
    }
    Err(e) if e.is_circuit_open() => {
        println!("Circuit breaker is open, using fallback");
        // Use cached data or alternative service
        let fallback_data = cache.get_alerts()?;
        Ok(fallback_data)
    }
    Err(e) => {
        println!("Request failed: {}", e);
        Err(e)
    }
}

Key Features

  1. Three States:

    • Closed: Normal operation, requests flow through
    • Open: Service failing, requests fail immediately (< 1ms)
    • Half-Open: Testing recovery with limited requests
  2. Automatic Recovery:

    • Configurable timeout before recovery testing
    • Multiple recovery strategies (fixed, linear, exponential backoff)
    • Gradual traffic restoration
  3. Comprehensive Monitoring:

// Check circuit breaker state
let state = circuit_breaker.state().await;
println!("Circuit state: {:?}", state);

// Get detailed information
let info = circuit_breaker.info().await;
println!("Error rate: {:.2}%", info.error_rate * 100.0);
println!("Total requests: {}", info.total_requests);
println!("Failures: {}", info.failure_count);

// Health check
let health = circuit_breaker.health_check().await;
  1. Manual Control (for operations):
# Force open (maintenance mode)
curl -X POST http://localhost:8080/v1/circuit-breakers/sentinel/open

# Force close (after maintenance)
curl -X POST http://localhost:8080/v1/circuit-breakers/sentinel/close

# Reset circuit breaker
curl -X POST http://localhost:8080/v1/circuit-breakers/sentinel/reset

# Get status
curl http://localhost:8080/v1/circuit-breakers/sentinel
  1. Configuration Example:
# config/circuit_breakers.yaml
circuit_breakers:
  sentinel:
    name: "sentinel-api"
    failure_threshold: 5
    success_threshold: 2
    timeout_secs: 60
    volume_threshold: 10
    recovery_strategy:
      type: "exponential_backoff"
      initial_timeout_secs: 60
      max_timeout_secs: 300
      multiplier: 2.0
  1. Prometheus Metrics:
circuit_breaker_state{name="sentinel"} 0           # 0=closed, 1=open, 2=half-open
circuit_breaker_requests_total{name="sentinel"}
circuit_breaker_requests_failed{name="sentinel"}
circuit_breaker_error_rate{name="sentinel"}
circuit_breaker_open_count{name="sentinel"}

See CIRCUIT_BREAKER_GUIDE.md for complete documentation.

Testing

Run All Tests

# Unit tests
cargo test --lib

# Integration tests
cargo test --test '*'

# All tests with coverage
cargo tarpaulin --all-features --workspace --timeout 120

Test Coverage

  • Unit Tests: 48 tests across all modules
  • Integration Tests: 75+ tests covering end-to-end workflows
  • Total Coverage: ~85%

Performance

Benchmarks

Operation Latency (p95) Throughput
Alert Processing < 50ms 10,000/sec
Incident Creation < 100ms 5,000/sec
ML Classification < 30ms 15,000/sec
Enrichment (cached) < 5ms 50,000/sec
Enrichment (uncached) < 150ms 3,000/sec
Correlation Analysis < 80ms 8,000/sec

Resource Requirements

Component CPU Memory Notes
Core Processor 2 cores 512MB Base requirements
ML Service 2 cores 1GB With models loaded
Enrichment Service 1 core 256MB With caching
PostgreSQL 4 cores 4GB For production

Documentation

Implementation Guides

API Documentation

Project Structure

llm-incident-manager/
├── src/
│   ├── api/              # REST/gRPC/GraphQL APIs
│   ├── config/           # Configuration management
│   ├── correlation/      # Correlation engine
│   ├── enrichment/       # Context enrichment
│   │   ├── enrichers.rs  # Enricher implementations
│   │   ├── models.rs     # Data structures
│   │   ├── pipeline.rs   # Enrichment orchestration
│   │   └── service.rs    # Service management
│   ├── error/            # Error types
│   ├── escalation/       # Escalation engine
│   ├── grpc/             # gRPC service implementations
│   ├── integrations/     # LLM integrations (NEW)
│   │   ├── common/       # Shared utilities (client trait, retry, auth)
│   │   ├── sentinel/     # Sentinel monitoring client
│   │   ├── shield/       # Shield security client
│   │   ├── edge_agent/   # Edge-Agent distributed client
│   │   └── governance/   # Governance compliance client
│   ├── ml/               # ML classification
│   │   ├── classifier.rs # Classification logic
│   │   ├── features.rs   # Feature extraction
│   │   ├── models.rs     # Data structures
│   │   └── service.rs    # Service management
│   ├── models/           # Core data models
│   ├── notifications/    # Notification service
│   ├── playbooks/        # Playbook automation
│   ├── processing/       # Incident processor
│   └── state/            # Storage implementations
├── tests/                # Integration tests
│   ├── integration_sentinel_test.rs     # Sentinel client tests
│   ├── integration_shield_test.rs       # Shield client tests
│   ├── integration_edge_agent_test.rs   # Edge-Agent client tests
│   └── integration_governance_test.rs   # Governance client tests
├── proto/                # Protocol buffer definitions
├── migrations/           # Database migrations
└── docs/                 # Additional documentation
    ├── LLM_CLIENT_README.md                 # LLM integrations overview
    ├── LLM_CLIENT_ARCHITECTURE.md           # Detailed architecture
    ├── LLM_CLIENT_IMPLEMENTATION_GUIDE.md   # Implementation guide
    ├── LLM_CLIENT_QUICK_REFERENCE.md        # Quick reference
    └── llm-client-types.ts                  # TypeScript type definitions

Development

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Code Style

# Format code
cargo fmt

# Lint
cargo clippy --all-features

# Check
cargo check --all-features

Running Locally

# Development mode with hot reload
cargo watch -x run

# With debug logging
RUST_LOG=debug cargo run

# With specific features
cargo run --features "postgresql,redis"

Deployment

Docker

FROM rust:1.70 as builder
WORKDIR /app
COPY . .
RUN cargo build --release

FROM debian:bookworm-slim
COPY --from=builder /app/target/release/llm-incident-manager /usr/local/bin/
CMD ["llm-incident-manager"]

Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: incident-manager
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: incident-manager
        image: llm-incident-manager:latest
        ports:
        - containerPort: 3000
        - containerPort: 50051
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: incident-manager-secrets
              key: database-url

Monitoring

Metrics (Prometheus)

The system exposes comprehensive metrics on port 9090 (configurable via LLM_IM__SERVER__METRICS_PORT).

Integration Metrics (per LLM integration):

llm_integration_requests_total{integration="sentinel|shield|edge-agent|governance"}
llm_integration_requests_successful{integration="..."}
llm_integration_requests_failed{integration="..."}
llm_integration_success_rate_percent{integration="..."}
llm_integration_latency_milliseconds_average{integration="..."}
llm_integration_last_request_timestamp{integration="..."}

Core System Metrics:

incident_manager_alerts_processed_total
incident_manager_incidents_created_total
incident_manager_incidents_resolved_total
incident_manager_escalations_triggered_total
incident_manager_enrichment_duration_seconds
incident_manager_enrichment_cache_hit_rate
incident_manager_correlation_groups_created_total
incident_manager_ml_predictions_total
incident_manager_ml_prediction_confidence
incident_manager_notifications_sent_total
incident_manager_processing_duration_seconds

Quick Access:

# Prometheus format
curl http://localhost:9090/metrics

# JSON format
curl http://localhost:8080/v1/metrics/integrations

For complete metrics documentation, dashboards, and alerting:

Health Checks

# Liveness probe
curl http://localhost:8080/health/live

# Readiness probe
curl http://localhost:8080/health/ready

# Full health status with metrics
curl http://localhost:8080/health

Security

Authentication

  • API Key authentication
  • mTLS for gRPC
  • JWT tokens for WebSocket

Data Protection

  • Encrypted at rest (PostgreSQL encryption)
  • TLS 1.3 in transit
  • Sensitive data redaction in logs

Vulnerability Reporting

Please report security issues to: security@example.com

License

This project is licensed under the MIT License - see the LICENSE file for details.

Built With

  • Rust - Systems programming language
  • Tokio - Async runtime
  • PostgreSQL - Primary database
  • SQLx - SQL toolkit
  • Tonic - gRPC implementation
  • Axum - Web framework
  • Serde - Serialization framework
  • SmartCore - Machine learning library
  • Tracing - Structured logging

Acknowledgments

Designed and implemented for enterprise-grade LLM infrastructure management with a focus on reliability, performance, and extensibility.


Status: Production Ready | Version: 1.0.0 | Language: Rust | Last Updated: 2025-11-12


Recent Updates

2025-11-12: LLM Integrations Module ✅

  • Implemented enterprise-grade LLM client integrations for Sentinel, Shield, Edge-Agent, and Governance
  • 5,913 lines of production Rust code with comprehensive error handling
  • 1,578 lines of integration tests (78 test cases)
  • Multi-framework compliance support (GDPR, HIPAA, SOC2, PCI, ISO27001)
  • gRPC bidirectional streaming for Edge-Agent
  • Exponential backoff retry logic with jitter
  • Complete documentation suite in /docs
Commit count: 0

cargo fmt