rafor

Crates.io	rafor
lib.rs	rafor
version	0.3.0
created_at	2025-08-16 21:15:32.274817+00
updated_at	2025-12-25 16:55:05.392734+00
description	Fast Random Forest library.
homepage
repository	https://github.com/savthe/rafor/
max_upload_size
id	1798911
size	1,968,703

(savthe)

documentation

README

Rafor is a performance-oriented Random Forest and Decision Trees library.

Classification

Rafor provide a decision tree (DT) classifier dt::Classifier and a random forest (RF) classifier rf::Classifier. The class label is i64 value. Classifiers use Gini index for evaluating the split impurity.

Classifiers provide method predict for predicting a batch of samples, it returns Vec<i64> with predicted class labels. Method predict_one returns i64 -- a predicted class for a single sample.

To get probabilities distribution, there is a method proba which returns a Vec<f32> of length num_samples * num_classes where i-th chunk of length num_classes contains the probabilities of classes for i-th sample. The classes are ordered by their values.

Regression

Regression models are decision tree regressor dt::Regressor and random forest regressor rf::Regressor. The targets are f32 values. By default regressors use MSE score for evaluating the split impurity.

Dataset

Multiple samples for inference or training are provided as a single f32 slice, where each chunk of the size of feature space (num_features) is treated as a feature vector of a single sample. During training, num_features is derieved as a length of the f32 input vector of samples deviced by the number of proviced targets.

Model training

All models provide method trainer() which returns a Trainer object for particular model. The Trainer incorporates builder interface (use rafor::prelude::*) for setting optional train parameters and a method train for feeding dataset and targets.

Currently supported training parameters are given below. Please see default values in concrete models.

Common parameters

The following parameters are common for decision trees and forests.

max_depth: usize defines the maximal tree depth.

max_features: [MaxFeaturesPolicy], the maximal number of features that are considered when finding best split value for decision tree node. Note that if no split value found, additional features will be considered until split is found or all features used.

seed: u64, defines the seed for random number generator. For trees the random numbers are used for generating the feature sequence when finding split when max_features is less than the number of all features of training dataset. In RF, the datasets are generated using random sampling, also the seeds for individual trees are randomly generated, because in RF by default max_features is less than the total number of features.

min_samples_leaf: usize, guarantees that each leaf has at least min_samples_leaf nodes. Default: 1.

min_samples_split: usize, the minimal samples in node to consider splitting it.

sample_weights: Vec<f32> defines the weight for each sample. If empty, each sample is weighted with 1.0

Ensemble parameters

num_trees: usize defines the number of individual trees in ensemble.

num_threads: usize defines the number of CPU threads to use for training.

Example

use rafor::prelude::*; // Required for .with_option builders and .num_classes().
use rafor::rf::Classifier;
use num_cpus; // Requires num_cpus dependency in Cargo.toml

fn main() {
    // Dataset for 5 samples (number of samples is defined by the number of targets).
    let dataset = [
        0.7, 0.0,
        0.8, 1.0,
        0.3, 0.0,
        1.0, 1.3,
        0.4, 2.1
    ];

    // Target classes.
    let targets = [1, 5, 1, -15, 5];

    let predictor = Classifier::trainer()
        .with_max_depth(15)
        .with_trees(40)
        .with_threads(num_cpus::get())
        .with_seed(42)
        .train(&dataset, &targets);

    // Get predictions for same dataset.
    let predictions = predictor.predict(&dataset, num_cpus::get());
    println!("Predictions: {:?}", predictions);

    // Now let's get probability distributions for each class. Use all CPU cores.
    let proba = predictor.proba(&dataset, num_cpus::get());
    println!("Probability distributions:");
    for p in proba.chunks(predictor.num_classes()) {
        println!("{:?}", p);
    }
}

Model serialization and deserialization

All models support serde, so any lib that supports serde can be used for serialization and deserialization.

Space / performance considerations

Rafor utilizes compact trees representation under the following restrictions:

split threshold is f32;
feature index is u16, up to 2^16 = 65,536 features allowed;
in regression tasks, the target type is f32;
in classification tasks, the class is represented by u32 (the input i64 labels are mapped into u32 internally, and restored during prediction);
child node index is u32, up to 2^32 = 4,294,967,296 nodes allowed.

The decision tree is represented by a vector of internal (parent) nodes. The leaf value (f32 for regression trees, u32 index pointing to the class probabilities for classification trees) is bit-packed into parent's u32 child node index.

License

Licensed under either of Apache License, Version 2.0 or MIT license at your option.

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in rafor by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Commit count: 27