scirs2-datasets

Crates.ioscirs2-datasets
lib.rsscirs2-datasets
version0.1.0-beta.2
created_at2025-04-12 19:47:14.542206+00
updated_at2025-09-20 08:57:10.128652+00
descriptionDatasets module for SciRS2 (scirs2-datasets)
homepage
repositoryhttps://github.com/cool-japan/scirs
max_upload_size
id1631187
size1,399,062
KitaSan (cool-japan)

documentation

README

SciRS2 Datasets

crates.io License Documentation

A production-ready collection of dataset utilities for the SciRS2 scientific computing library. This module provides comprehensive functionality for loading, generating, and working with datasets commonly used in scientific computing, machine learning, and statistical analysis.

🚀 Production Status - First Beta (0.1.0-beta.2)

This is the first beta release with all core functionality implemented, thoroughly tested (117+ tests), and production-ready. The API is stable and follows Rust best practices with zero-warning builds.

✨ Features

  • 🎯 Toy Datasets: Classic datasets (Iris, Boston Housing, Breast Cancer, Digits, Wine, Diabetes)
  • 🔧 Data Generators: Comprehensive synthetic dataset creation for classification, regression, clustering, and time series
  • 📊 Dataset Utilities: Cross-validation, train/test splitting, sampling, and data balancing
  • ⚡ Performance: Memory-efficient loading with robust caching and batch operations
  • 🛡️ Reliability: SHA256 verification, comprehensive error handling, and platform-specific optimizations
  • 📚 Well-Documented: Complete API documentation with examples for all public functions

Installation

Add to your Cargo.toml:

[dependencies]
scirs2-datasets = "0.1.0-beta.2"

For remote dataset downloading capabilities:

[dependencies]
scirs2-datasets = { version = "0.1.0-beta.2", features = ["download"] }

Quick Start

Load Classic Datasets

use scirs2_datasets::{load_iris, load_boston, Dataset};

// Load the Iris dataset
let iris = load_iris()?;
println!("Iris: {} samples, {} features", iris.n_samples(), iris.n_features());

// Load Boston housing dataset  
let boston = load_boston()?;
println!("Boston: {} samples, {} features", boston.n_samples(), boston.n_features());

Generate Synthetic Data

use scirs2_datasets::{make_classification, make_regression, make_blobs, make_spirals};

// Classification dataset
let dataset = make_classification(1000, 10, 3, 2, 4, Some(42))?;
println!("Classification: {} samples, {} features", dataset.n_samples(), dataset.n_features());

// Non-linear patterns
let spirals = make_spirals(500, 2, 0.1, Some(42))?;
let blobs = make_blobs(300, 2, 4, 1.0, Some(42))?;

Cross-Validation and Splitting

use scirs2_datasets::{load_iris, k_fold_split, stratified_k_fold_split, train_test_split};

let iris = load_iris()?;

// K-fold cross-validation
let folds = k_fold_split(iris.n_samples(), 5, true, Some(42))?;

// Stratified splitting with targets
if let Some(target) = &iris.target {
    let stratified_folds = stratified_k_fold_split(target, 5, true, Some(42))?;
    let (train_idx, test_idx) = train_test_split(iris.n_samples(), 0.8, Some(42))?;
}

Core Components

🎯 Toy Datasets

Pre-loaded classic datasets for immediate use:

use scirs2_datasets::{load_iris, load_digits, load_wine, load_breast_cancer, load_diabetes, load_boston};

// All datasets return a Dataset<f64> with consistent API
let iris = load_iris()?;          // 150 samples, 4 features, 3 classes
let digits = load_digits()?;      // 1797 samples, 64 features, 10 classes  
let wine = load_wine()?;          // 178 samples, 13 features, 3 classes
let cancer = load_breast_cancer()?; // 569 samples, 30 features, 2 classes
let diabetes = load_diabetes()?;  // 442 samples, 10 features, regression
let boston = load_boston()?;      // 506 samples, 13 features, regression

🔧 Data Generators

Comprehensive synthetic dataset creation:

use scirs2_datasets::{
    make_classification, make_regression, make_blobs, make_circles,
    make_moons, make_spirals, make_swiss_roll, make_time_series
};

// Linear and non-linear patterns
let classification = make_classification(500, 8, 2, 1, 2, Some(42))?;
let regression = make_regression(400, 5, 3, 0.1, Some(42))?;
let circles = make_circles(300, 0.1, Some(42))?;
let moons = make_moons(200, 0.05, Some(42))?;

// Complex patterns
let spirals = make_spirals(600, 3, 0.2, Some(42))?;
let swiss_roll = make_swiss_roll(800, 0.1, Some(42))?;

// Time series
let ts = make_time_series(1000, 24, 0.1, Some(42))?;

📊 Dataset Utilities

Complete toolkit for dataset manipulation:

use scirs2_datasets::{
    // Cross-validation
    k_fold_split, stratified_k_fold_split, time_series_split,
    // Sampling  
    random_sample, stratified_sample, bootstrap_sample, importance_sample,
    // Balancing
    create_balanced_dataset, random_oversample, random_undersample,
    // Feature engineering
    polynomial_features, create_binned_features, statistical_features,
    // Scaling
    min_max_scale, robust_scale, normalize
};

⚡ Caching System

Efficient dataset management with automatic caching:

use scirs2_datasets::{CacheManager, DatasetCache};

let cache = CacheManager::new()?;
let stats = cache.get_statistics()?;
println!("Cache contains {} datasets using {} MB", 
         stats.total_files, stats.total_size_mb);

Performance & Reliability

  • Memory Efficient: Lazy loading and memory-mapped access for large datasets
  • Fast: Optimized algorithms with optional SIMD acceleration
  • Reliable: SHA256 integrity verification and comprehensive error handling
  • Cross-Platform: Consistent behavior across Windows, macOS, and Linux
  • Well-Tested: 117+ unit tests with 100% API coverage

API Stability

The API is stable and production-ready. All public functions are thoroughly documented with examples. Breaking changes will only occur in major version updates (1.0.0+).

Integration

Seamlessly integrates with other SciRS2 modules:

use scirs2_datasets::{load_iris, make_classification};
// Use with scirs2-stats, scirs2-linalg, etc.

Contributing

See the project CONTRIBUTING.md for guidelines. Focus areas for contributions:

  • Performance optimization and benchmarking
  • Additional real-world datasets
  • Advanced data generation algorithms
  • Integration examples and tutorials

License

Dual-licensed under MIT or Apache License 2.0.

Commit count: 9

cargo fmt