gbrt-rs

Crates.io	gbrt-rs
lib.rs	gbrt-rs
version	0.2.0
created_at	2025-12-10 09:15:19.797715+00
updated_at	2025-12-10 09:15:19.797715+00
description	Gradient Boosted Regression Trees in Rust
homepage
repository	https://github.com/Guap-Codes/gbrt-rs
max_upload_size
id	1977716
size	838,298

Sheikh Cartier (Guap-Codes)

documentation

README

GBRT-RS: Gradient Boosted Regression Trees in Rust

A high-performance, production-ready implementation of Gradient Boosted Regression Trees (GBRT) in Rust, engineered for speed, memory safety, and reproducible machine learning workflows. This library provides both a comprehensive CLI tool and a programmatic API, supporting regression and binary classification tasks with advanced features including intelligent categorical encoding with automatic metadata capture, early stopping with patience-based termination, k-fold cross-validation with proper metric calculation, feature importance analysis, and robust model serialization.

Features

High Performance: Written in Rust for speed and memory safety
Dual Interface: Both programmatic API and comprehensive CLI
Multiple Objectives: Regression (MSE, MAE, Huber) and Binary Classification (LogLoss)
Categorical Features: First-class support with encoding and auto-detection
Early Stopping: Validation-based early stopping with configurable patience
Cross-Validation: Built-in k-fold CV with proper metrics
Feature Importance: Analyze which features drive predictions
Serialization: Save/load trained models to/from JSON
Data Validation: Automatic dataset validation and error handling
Hyperparameter Tuning: Full control over learning rate, depth, subsampling

Installation

Install from Crates.io

cargo install gbrt-rs

Build from Source

git clone https://github.com/Guap-Codes/gbrt-rs.git

cd gbrt-rs

cargo build --release

The optimized binary will be available at target/release/gbrt-rs.

Quick Start

Command Line Interface (Recommended for Data Scientists)

Train a model on housing data:

# Train with validation and categorical features
./gbrt-rs train \
  --data housing.csv \
  --target price \
  --categorical-columns neighborhood,zip_code \
  --validation validation.csv \
  --early-stopping-rounds 15 \
  --output home_model \
  --verbose

# Cross-validate to assess generalization
./gbrt-rs cross-validate \
  --data housing.csv \
  --target price \
  --categorical-columns neighborhood,zip_code \
  --folds 5

# Evaluate on test set
./gbrt-rs evaluate \
  --model model.json \
  --data test.csv \
  --target price

# Make predictions
./gbrt-rs predict \
  --model model.json \
  --data new_data.csv \
  --output predictions.csv

# Analyze feature importance
./gbrt-rs feature-importance \
  --model model.json \
  --top-k 10 \
  --output importance.json

Programmatic API (For Rust Developers)

use gbrt_rs::{GBRTModel, Dataset, DataLoader, ModelMetrics, Result};

fn main() -> Result<()> {
    // Load data with categorical encoding
    let data_loader = DataLoader::new()?;
    let categorical_cols = vec![4, 5]; // indices for neighborhood, zip_code
    let dataset = data_loader.load_csv_with_categorical(
        "housing.csv",
        "price",
        Some(&categorical_cols),
        0.1
    )?;
    
    // Train model
    let mut model = GBRTModel::new()?;
    model.set_feature_names(vec![
        "square_feet".to_string(),
        "bedrooms".to_string(),
        "neighborhood".to_string(),
        "zip_code".to_string(),
    ]);
    model.fit(&dataset)?;
    
    // Evaluate
    let predictions = model.predict(dataset.features())?;
    let metrics = ModelMetrics::regression_metrics(
        dataset.targets().as_slice().unwrap(),
        &predictions
    )?;
    println!("R²: {:.4}", metrics.r2);
    
    // Save model
    let model_io = ModelIO::new()?;
    model_io.save_model(&model.booster(), "model.json", "housing_model")?;
    
    Ok(())
}

CLI Reference

Core Commands

`train` - Train a new model

./gbrt-rs train [OPTIONS] --data <FILE> --target <COLUMN>

Required:
  -d, --data <FILE>         Training data (CSV)
  -t, --target <COLUMN>     Target column name

Options:
  -o, --output <FILE>              Save trained model
  -m, --model-type <TYPE>          regression|classification [default: regression]
      --n-estimators <N>           Boosting rounds [default: 100]
      --learning-rate <RATE>       Step size [default: 0.1]
      --max-depth <DEPTH>          Tree depth [default: 6]
      --subsample <RATIO>          Row sampling [default: 1.0]
  -V, --validation <FILE>          Separate validation set
      --test-split <RATIO>         Internal split ratio [default: 0.2]
      --early-stopping-rounds <N>  Patience in rounds
      --early-stopping-tolerance   Minimum improvement [default: 0.001]
      --categorical-columns <COLS> Comma-separated column names
      --categorical-threshold <T>  Auto-detect ratio [default: 0.1]
      --feature-names <NAMES>      Override feature names
  -v, --verbose                    Show training progress
  -h, --help                       Print help

`cross-validate` - K-fold cross-validation

./gbrt-rs cross-validate [OPTIONS] --data <FILE> --target <COLUMN>

Options:
  -f, --folds <N>                Number of folds [default: 5]
      --categorical-columns <COLS>
      --categorical-threshold <T>
  -v, --verbose

Outputs per-fold and mean R² (or accuracy for classification).

Analysis Commands

`evaluate` - Compute metrics on test data

./gbrt-rs evaluate --model <MODEL> --data <FILE> --target <COLUMN> [--verbose]

Outputs RMSE, MAE, R², and sample count.

`predict` - Generate predictions

./gbrt-rs predict --model <MODEL> --data <FILE> [--output <FILE>] [--format csv|json]

Writes predictions to CSV/JSON or stdout.

`feature-importance` - Analyze model

./gbrt-rs feature-importance --model <MODEL> [--top-k <N>] [--output <FILE>]

Shows top features by importance score.

`info` - Model metadata

./gbrt-rs info --model <MODEL> [--verbose]

Displays model type, features, trees, parameters, and training state.

Library API Reference

Core Types

GBRTModel: Main model struct
- new(): Regression model
- new_classifier(): Classification model
- with_config(config): Custom configuration
- fit(&dataset): Train on data
- fit_with_validation(&train, &val): Train with early stopping
- predict(&features) -> Vec<f64>: Predictions
- feature_importance() -> Vec<f64>: Importance scores
- training_history() -> Option<&TrainingState>: Training metrics
Dataset: Data container
- features() -> &FeatureMatrix
- targets() -> &Array
- n_samples(), n_features()
- validate(): Check data integrity
DataLoader: CSV loading with encoding
- load_csv(path): Load basic CSV
- load_csv_with_categorical(path, target, cat_indices, threshold): With encoding
- cross_validation_splits(dataset, k): Generate CV folds

Configuration

use gbrt_rs::core::{GBRTConfig, TreeConfig, LossFunction};

let config = GBRTConfig {
    n_estimators: 200,
    learning_rate: 0.1,
    subsample: 0.8,
    loss: LossFunction::MSE,
    tree_config: TreeConfig {
        max_depth: 6,
        min_samples_split: 2,
        min_samples_leaf: 1,
        lambda: 1.0, // L2 regularization
        ..Default::default()
    },
    early_stopping_rounds: Some(15),
    early_stopping_tolerance: 0.001,
    ..Default::default()
};

let model = GBRTModel::with_config(config)?;

Data Format Requirements

CSV Specifications

Header row: Required with column names
Target column: Must be specified via --target
Numeric columns: Must contain valid numbers only
Categorical columns: String values, auto-detected or explicitly specified
Missing values: Not supported (will cause errors)
Encoding: UTF-8 required

Categorical Feature Best Practices

Explicit Specification (Recommended):

--categorical-columns neighborhood,zip_code,sector

Auto-Detection (Use with caution):

Threshold = unique_values / total_samples (default 0.1)
Columns with < 10% unique values become categorical
Risk: May misclassify discrete numeric features

Why Explicit is Better:

Guarantees consistent encoding across train/test
Reproducible and self-documenting
Avoids threshold boundary issues

Architecture Overview

src/
├── lib.rs              # Public API & convenience functions
├── main.rs             # CLI entry point & command handlers
├── core/               # Core types and configurations
│   ├── GBRTConfig      # Main booster configuration
│   ├── TreeConfig      # Tree-specific parameters
│   └── LossFunction    # Objective enum
├── boosting/           # Gradient boosting engine
│   ├── GradientBooster # Core boosting algorithm
│   ├── BoosterFactory  # Factory methods
│   └── TrainingState   # Metrics tracking
├── tree/               # Decision tree components
│   ├── TreeBuilder     # Builder pattern implementation
│   ├── BestSplitter    # Split finding with binning
│   └── MSECriterion    # Split quality computation
├── data/               # Data structures and preprocessing
│   ├── Dataset         # Container for X and y
│   ├── FeatureMatrix   # Feature storage with encoding
│   └── preprocessing   # Categorical encoding logic
├── objective/          # Loss functions
│   ├── MSEObjective    # Regression
│   └── LogLossObjective # Classification
└── io/                  # I/O operations
    ├── DataLoader      # CSV reading with encoding
    ├── ModelIO         # JSON serialization
    └── SaveOptions     # Metadata storage

Key Design Decisions

Builder Pattern: TreeBuilder allows custom splitters/criteria
Trait Abstractions: Splitter, Criterion enable experimentation
Error Propagation: Result<T, GBRTError> for robust error handling
Zero-Copy: FeatureMatrix avoids unnecessary allocations
Type Safety: Strong typing prevents invalid configurations

Performance Guidelines

Training Speed Optimization

# Faster, less accurate
--n-estimators 100 --max-depth 4 --learning-rate 0.3

# Balanced (recommended)
--n-estimators 200 --max-depth 6 --learning-rate 0.1

# Slower, more accurate
--n-estimators 500 --max-depth 8 --learning-rate 0.05

Memory Usage

Each tree depth increases memory usage exponentially
Large categorical encodings (many categories) increase memory
Use --subsample 0.8 to reduce memory footprint

Prediction Latency

Single-threaded prediction (currently)
Linear in number of trees and depth
For real-time: consider model pruning or quantization

Configuration Examples

Housing Price Regression

./gbrt-rs train \
  --data housing.csv \
  --target price \
  --categorical-columns neighborhood,zip_code \
  --n-estimators 300 \
  --max-depth 6 \
  --learning-rate 0.05 \
  --early-stopping-rounds 20 \
  --output housing_model.json

Stock Price Direction Classification

./gbrt-rs train \
  --data stocks.csv \
  --target direction \
  --model-type classification \
  --categorical-columns sector \
  --n-estimators 200 \
  --max-depth 5 \
  --subsample 0.8 \
  --output stock_classifier.json

Large Dataset with Sampling

./gbrt-rs train \
  --data large_dataset.csv \
  --target target \
  --n-estimators 1000 \
  --max-depth 4 \
  --learning-rate 0.01 \
  --subsample 0.5 \
  --early-stopping-rounds 50 \
  --output large_model.json

Troubleshooting & FAQ

Q: Why am I getting low R² scores?

A: Common causes:

Categorical columns not specified → model sees raw strings as missing
Too few trees (--n-estimators) or too shallow (--max-depth)
Learning rate too high (try 0.05-0.1 for regression)
Data leakage: features that include target information

Debug steps:

# Check model info
./gbrt-rs info --model model.json --verbose

# Ensure encoded features count matches expectations
# Should see: Features: 8 (original) + N (encoded categories)

# Run cross-validation to detect overfitting/underfitting
./gbrt-rs cross-validate --data train.csv --target target --folds 5

Q: Why does evaluation show catastrophic R² (negative millions)?

A: This is the categorical encoding mismatch bug. The evaluation data wasn't encoded the same way as training data. Update to latest version where this is fixed.

Q: What's the difference between `--validation` and `--test-split`?

--validation: Uses a separate file you provide (recommended for reproducibility)
--test-split: Splits your training data internally (convenient but less control)

Q: How do I handle large datasets that don't fit in memory?

A: Current version loads all data into memory. For large datasets:

Use --subsample 0.5 to train on subset
Reduce --max-depth to limit tree size
Future version will add streaming support

Q: Do I need to specify --categorical-columns for every command?

A: No! Only for:

✅ train (stores in model metadata)
✅ cross-validate (trains new models each time)

Not needed for:

❌ evaluate (reads from model metadata)
❌ predict (reads from model metadata)
❌ feature-importance (uses stored feature names)
❌ info (displays stored metadata)

Q: Can I use this for time series forecasting?

A: GBRT is not ideal for time series because:

No temporal awareness
Assumes i.i.d. data
Use specialized libraries like prophet or statsforecast instead

Q: How do I interpret feature importance?

A: Importance = total reduction in loss contributed by splits on that feature. Important notes:

Only shows predictive power, not causation
Can be biased toward high-cardinality features
Use SHAP values for more nuanced interpretation (planned feature)

Contributing

We welcome contributions! Please see CONTRIBUTING.md.

Development Setup

# Fork and clone
git clone https://github.com/Guap-Codes/gbrt-rs.git
cd gbrt-rs

# Install Rust tools
rustup component add rustfmt clippy

# Setup pre-commit hooks
pip install pre-commit
pre-commit install

# Create feature branch
git checkout -b feature/my-feature

# Make changes ensuring:
cargo fmt --all              # Format code
cargo clippy --all -- -D warnings  # Fix warnings
cargo test --all             # Run tests

# Submit PR

Code Standards

Follow Rust naming conventions (snake_case, UpperCamelCase)
Add tests for new functionality
Update documentation
Keep CI green

Roadmap

v0.2.0 (Current)

✅ Regression and binary classification
✅ Categorical feature encoding
✅ Early stopping
✅ Cross-validation
✅ Feature importance

v0.3.0 (In Progress)

Multi-threaded training
Multi-class classification
Missing value imputation
Hyperparameter search utilities

v0.4.0 (Future)

GPU acceleration via CUDA
SHAP value computation
ONNX model export
Python bindings
Streaming for large datasets

Benchmarks

Performance on a 100,000 sample, 50 feature dataset:

Library	Train Time	Predict Time	Memory	R²
gbrt-rs	2.3s	12ms	85 MB	0.847
XGBoost	2.8s	18ms	120 MB	0.852
LightGBM	1.9s	15ms	95 MB	0.849
sklearn	8.7s	45ms	280 MB	0.831

Measured on: AMD Ryzen 9 5900X, 32GB RAM, single-threaded

License

This project is licensed under the MIT License - see LICENSE for details.

Citation

If you use GBRT-RS in research, please cite:

@software{gbrt_rs_2025,
  title={GBRT-RS: High-Performance Gradient Boosting in Rust},
  author={Guap-Codes},
  year={2025},
  url={https://github.com/Guap-Codes/gbrt-rs},
  version={0.2.0}
}

Acknowledgments

Algorithm: Based on "Greedy Function Approximation: A Gradient Boosting Machine" (Friedman, 2001)
Inspiration: XGBoost, LightGBM, CatBoost
Rust Ecosystem: Built with ndarray, serde, clap, csv, thiserror

Contact & Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: yahayajelil@yahoo.com

Happy Boosting! 🚀

Made with ❤️ in Rust

Commit count: 0