| Crates.io | pkboost |
| lib.rs | pkboost |
| version | 2.2.0 |
| created_at | 2025-12-29 06:45:22.549715+00 |
| updated_at | 2026-01-21 11:28:14.5869+00 |
| description | Shannon-guided gradient boosting for extreme class imbalance with adaptive drift detection. Outperforms XGBoost/LightGBM on imbalanced data. |
| homepage | https://github.com/Pushp-Kharat1/pkboost |
| repository | https://github.com/Pushp-Kharat1/pkboost |
| max_upload_size | |
| id | 2010090 |
| size | 1,606,520 |
Built from scratch in Rust, PKBoost (Performance-Based Knowledge Booster) manages changing data distributions in fraud detection with a fraud rate of 0.2%. It shows less than 2% degradation under drift. In comparison, XGBoost experiences a 31.8% drop and LightGBM a 42.5% drop. PKBoost outperforms XGBoost by 10-18% on the Standard dataset when no drift is applied. It employs information theory with Shannon entropy and Newton Raphson to identify shifts in rare events and trigger an adaptive "metamorphosis" for real-time recovery.
"Most boosting libraries overlook concept drift. PKBoost identifies it and evolves to persist."
Perfect for: Multi-class fraud detection, real-time medical diagnosis, anomaly detection in changing environments, or any scenario where data evolves over time and minority classes are critical.
See CHANGELOG_V2.md for full details.
Clone the repository and build:
git clone https://github.com/Pushp-Kharat1/pkboost.git
cd pkboost
cargo build --release
Run the benchmark:
data/)ls data/ # Should show creditcard_train.csv, creditcard_val.csv, etc.
cargo run --release --bin benchmark
To train and predict (see src/bin/benchmark.rs for a full example):
use pkboost::*;
use csv;
use std::error::Error;
fn main() -> Result<(), Box<dyn Error>> {
// Load CSV with headers: feature1,feature2,...,Class
let (x_train, y_train) = load_csv("train.csv")?;
let (x_val, y_val) = load_csv("val.csv")?;
let (x_test, y_test) = load_csv("test.csv")?;
// Auto-configure based on data characteristics
let mut model = OptimizedPKBoostShannon::auto(&x_train, &y_train);
// Train with early stopping on validation set
model.fit(
&x_train,
&y_train,
Some((&x_val, &y_val)), // Optional validation
true // Verbose output
)?;
// Predict probabilities (not classes)
let test_probs = model.predict_proba(&x_test)?;
// Evaluate
let pr_auc = calculate_pr_auc(&y_test, &test_probs);
println!("PR-AUC: {:.4}", pr_auc);
Ok(())
}
// Helper function (put in your code)
fn load_csv(path: &str) -> Result<(Vec<Vec<f64>>, Vec<f64>), Box<dyn Error>> {
let mut reader = csv::Reader::from_path(path)?;
let headers = reader.headers()?.clone();
let target_col_index = headers.iter().position(|h| h == "Class")
.ok_or("Class column not found")?;
let mut features = Vec::new();
let mut labels = Vec::new();
for result in reader.records() {
let record = result?;
let mut row: Vec<f64> = Vec::new();
for (i, value) in record.iter().enumerate() {
if i == target_col_index {
labels.push(value.parse()?);
} else {
let parsed_value = if value.is_empty() {
f64::NAN
} else {
value.parse()?
};
row.push(parsed_value);
}
}
features.push(row);
}
Ok((features, labels))
}
Expected CSV format:
src/bin/*.rs files like benchmark.rs. Supports CSV via csv crate.Regression usage:
use pkboost::*;
let mut model = PKBoostRegressor::auto(&x_train, &y_train);
model.fit(&x_train, &y_train, Some((&x_val, &y_val)), true)?;
let predictions = model.predict(&x_test)?;
let rmse = calculate_rmse(&y_test, &predictions);
let r2 = calculate_r2(&y_test, &predictions);
println!("RMSE: {:.4}, R²: {:.4}", rmse, r2);
Multi-class usage:
use pkboost::MultiClassPKBoost;
// y_train contains class labels: 0.0, 1.0, 2.0, ...
let mut model = MultiClassPKBoost::new(3); // 3 classes
model.fit(&x_train, &y_train, None, true)?;
let probs = model.predict_proba(&x_test)?; // [n_samples, n_classes]
let predictions = model.predict(&x_test)?; // class indices
let accuracy = predictions.iter().zip(y_test.iter())
.filter(|(&pred, &true_y)| pred == true_y as usize)
.count() as f64 / y_test.len() as f64;
println!("Accuracy: {:.2}%", accuracy * 100.0);
Extreme Imbalance Handling: Automatic class weighting and MI regularization boost recall on rare positives without reducing precision. Binary classification only.
Adaptive Hyperparameters: auto_tune_principled profiles your dataset for optimal params—no manual tuning needed.
Histogram-Based Trees: Optimized binning with medians for missing values; supports up to 32 bins per feature for fast splits.
Parallelism & Efficiency: Rayon-based adaptive parallelism detects hardware and scales thresholds dynamically. Efficient batching is used for large datasets.
Adaptation Mechanisms: AdversarialLivingBooster monitors vulnerability scores to detect drift and trigger retraining, such as pruning unused features through "metabolism" tracking.
Metrics Built-In: PR-AUC, ROC-AUC, F1@0.5, and threshold optimization are available out-of-the-box.
For full mathematical derivations, Refer to: Math.pdf
Testing methodology: All models use default settings with no hyperparameter tuning. This reflects real-world usage where most practitioners cannot dedicate time to extensive tuning.
PKBoost's auto-tuning provides an edge—it automatically detects imbalance and adjusts parameters. LGBM/XGB can match these results with tuning but require expert knowledge.
Reproducibility: All benchmark code is in src/bin/benchmark.rs. Data splits: 60% train, 20% val, 20% test. LGBM/XGB used default params from their Rust crates. Full benchmarks (10+ datasets): See BENCHMARKS.md.
| Dataset | Samples | Imbalance | Model | PR-AUC | F1-AUC | ROC-AUC |
|---|---|---|---|---|---|---|
| Credit Card | 170,884 | 0.2% (extreme) | PKBoost | 87.8% | 87.4% | 97.5% |
| LightGBM | 79.3% | 71.3% | 92.1% | |||
| XGBoost | 74.5% | 79.8% | 91.7% | |||
| Improvements | vs LGBM | +10.4% | +22.7% | +5.7% | ||
| vs XGBoost | +17.9% | +9.7% | +6.1% | |||
| Pima Diabetes | 460 | 35.0% (balanced) | PKBoost | 98.0% | 93.7% | 98.6% |
| LightGBM | 62.9% | 48.8% | 82.4% | |||
| XGBoost | 68.0% | 60.0% | 82.0% | |||
| Improvements | vs LGBM | +55.7% | +92.0% | +19.6% | ||
| vs XGBoost | +44.0% | +56.1% | +20.1% | |||
| Breast Cancer | 341 | 37.2% (balanced) | PKBoost | 97.9% | 93.2% | 98.6% |
| LightGBM | 99.1% | 96.3% | 99.2% | |||
| XGBoost | 99.2% | 95.1% | 99.4% | |||
| Improvements | vs LGBM | -1.2% | -3.3% | -0.7% | ||
| vs XGBoost | -1.4% | -2.1% | -0.8% | |||
| Heart Disease | 181 | 45.9% (balanced) | PKBoost | 87.8% | 82.5% | 88.5% |
| Ionosphere | 210 | 35.7% (balanced) | PKBoost | 98.0% | 93.7% | 98.5% |
| LightGBM | 95.4% | 88.9% | 96.0% | |||
| XGBoost | 97.2% | 88.9% | 97.5% | |||
| Improvements | vs LGBM | +2.7% | +5.4% | +2.7% | ||
| vs XGBoost | +0.8% | +5.4% | +1.1% | |||
| Sonar | 124 | 46.8% (balanced) | PKBoost | 91.8% | 87.2% | 93.6% |
| SpamBase | 2,760 | 39.4% (balanced) | PKBoost | 98.0% | 93.3% | 98.0% |
| Adult | - | 24.1% (balanced) | PKBoost | 81.2% | 71.9% | 92.0% |
| Dataset | Classes | Imbalance | Model | Accuracy | Macro-F1 | Time(s) |
|---|---|---|---|---|---|---|
| Synthetic-5 | 5 | 16.7:1 (50%/3%) | PKBoost | 100.0% | 1.0000 | 3.43 |
| LightGBM | 71.8% | 0.5835 | 0.87 | |||
| XGBoost | 70.7% | 0.5568 | 1.57 | |||
| Improvements | vs LGBM | +39.3% | +71.4% | -3.9x | ||
| vs XGBoost | +41.4% | +79.6% | -2.2x |
Notes: PR-AUC is prioritized for imbalance; F1@0.5 uses the optimal threshold. Unfilled cells indicate benchmarks in progress. Note on Pima Diabetes: Small datasets (n=460) have high variance due to limited samples. Results may not generalize; re-run with your data for confirmation. Note on Breast Cancer: PKBoost slightly underperforms on nearly balanced datasets (37% minority). This is expected—our optimizations target extreme imbalance. For balanced data, use XGBoost.
Credit Card Fraud (0.2% minority class):
Pattern: As imbalance severity increases (from balanced to 5% to 1% to 0.2%), traditional boosting drops linearly while PKBoost maintains high accuracy.
PKBoost features experimental drift detection that monitors model vulnerabilities and can trigger adaptive retraining.
Benchmark: After introducing a significant covariate shift (adding noise to 10 features), models were tested on corrupted data:
| Model | Baseline PR-AUC | After Drift | Degradation |
|---|---|---|---|
| PKBoost | 87.8% | 86.2% | 1.8% |
| LightGBM | 79.3% | 45.6% | 42.5% |
| XGBoost | 74.5% | 50.8% | 31.8% |
PKBoost's robustness comes from:
Note: Adaptive retraining is experimental and didn't trigger in this test. The robustness comes from the base architecture.
Traditional gradient boosting struggles with extreme imbalance because:
PKBoost's approach:
Technical innovation: Fusing information theory with Newton boosting. Each split maximizes:
Gain = GradientGain + λ * InformationGain
Where λ is adaptive based on imbalance severity.
[Your Data] → [Auto-Tuner] → [Shannon-Guided Trees] → [Predictions]
↓ ↓ ↓
Detects Entropy + Gradient PR-AUC
Imbalance Split Criterion Optimized
Core Model: OptimizedPKBoostShannon – Shannon-entropy regularized trees with MI weighting.
Data Prep: OptimizedHistogramBuilder – Fast binning, median imputation, parallel transforms.
Tuning: auto_tune_principled & auto_params – Dataset-aware hyperparameters.
Adaptation: AdversarialLivingBooster – Monitors drift through vulnerability scores; triggers retraining, such as feature pruning via metabolism tracking.
Parallelism: adaptive_parallel – Hardware-aware Rayon config (cores, RAM detection).
Evaluation: Built-in calculations for PR-AUC, ROC-AUC, and F1.
Drift Sims: Scripts like test_drift.rs and test_static.rs for baseline comparisons.
See src/ for full implementation. Binary classification only.
Benchmark: Credit Card Fraud (~57K samples, 0.17% fraud rate)
| Model | PR-AUC | ROC-AUC | F1 | Precision | Train Time |
|---|---|---|---|---|---|
| PKBoost | 84.6% | 95.2% | 86.5% | 94.1% | ~1.7s |
| LightGBM | 83.7% | 94.9% | 76.2% | 72.7% | ~0.6s |
| XGBoost | 80.4% | 93.6% | 76.9% | 78.9% | ~1.0s |
Python Package:
pip install pkboost
See Python Bindings Guide for full API documentation.
Install Rust:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Clone & build: As above.
Run:
cargo run --release --bin benchmark # uses data/*.csv
Drift tests:
cargo run --bin test_drift
Datasets sourced from UCI/ML.
"error: linker cc not found"
sudo apt install build-essentialOut of memory during compilation:
cargo build --release --jobs 1 # Limit parallel compilation
Slow training on large datasets:
--release flagOpen for contributions! Fork & PR: Focus on extensions, optimizations, or new tests. Issues welcome for bugs or dataset requests.
Contact: kharatpushp16@outlook.com
PKBoost is dual-licensed under:
You may choose either license when using this software.
If you use PKBoost in your research, please cite:
@software{kharat2025pkboost,
author = {Kharat, Pushp},
title = {PKBoost: Shannon-Guided Gradient Boosting for Extreme Imbalance},
year = {2025},
url = {https://github.com/Pushp-Kharat1/pkboost}
}
Questions? Open an issue.
Library by Pushp Kharat. Last updated: December 27, 2025.