| Crates.io | oxidizr |
| lib.rs | oxidizr |
| version | 0.1.0-beta1 |
| created_at | 2025-12-04 21:09:35.658763+00 |
| updated_at | 2025-12-04 21:09:35.658763+00 |
| description | A Rust-based LLM training framework built on Candle |
| homepage | https://github.com/farhan-syah/oxidizr |
| repository | https://github.com/farhan-syah/oxidizr |
| max_upload_size | |
| id | 1967069 |
| size | 401,505 |
A Rust-based LLM training framework built on Candle. Oxidizr is a flexible trainer - bring your own config and dataset, and start training.
Full Documentation | Architecture Guide | CLI Reference
Recommended: Install from Git (supports CUDA 12.x and 13.x):
cargo install --git https://github.com/farhan-syah/oxidizr
From crates.io (CUDA 12.x only):
cargo install oxidizr
Note: The crates.io version only supports CUDA 12.x. For CUDA 13.x support, install from Git.
# Clone and build
git clone https://github.com/farhan-syah/oxidizr
cd oxidizr
cargo build --release
# Train with a sample config
cargo run --release -- -f models/nano.yaml
That's it! Training works on CPU out of the box - no GPU required.
For faster training with GPU:
cargo build --release --features cuda
cargo run --release --features cuda -- -f models/nano.yaml
GPU training is significantly faster but completely optional. CPU training is fully functional, just slower.
Oxidizr is a production-grade LLM trainer written in Rust. You provide:
Oxidizr handles the training loop, optimization, checkpointing, and logging.
We include several example configs in models/ to help you get started:
nano.yaml - Small Llama-style GPT (~83M params, Llama 3 vocab)nano_mamba2.yaml - Hybrid Mamba2 + MLA architecturenano_mamba2_pure.yaml - Pure Mamba2 architectureThese are educational examples showing you how to configure oxidizr. Feel free to create your own configs for your specific use case.
Create a YAML file with your model architecture and training settings:
# my_model.yaml
model:
hidden_size: 512
num_layers: 8
num_heads: 8
kv_heads: 4
vocab_size: 128354 # Llama 3 + splintr agent tokens
max_seq_len: 512
rope_theta: 10000.0
intermediate_size: 2048
trainer:
learning_rate: 0.0003
batch_size: 2
max_steps: 5000
num_epochs: 2
gradient_accumulation: 1
checkpoint_dir: "./checkpoints"
log_interval: 10
save_interval: 500
Run it:
cargo run --release --features cuda -- -f my_model.yaml
Oxidizr accepts tokenized data in binary format (u32 tokens):
Option 1: Use the educational dataset
The data/nano-start/ directory contains a curated educational dataset designed to help you understand LLM training fundamentals. See the data/ directory for details.
Option 2: Bring your own tokenized data
Create a binary file containing raw u32 tokens:
# Using splintr tokenizer (recommended)
from splintr import Tokenizer
tokenizer = Tokenizer("llama3")
tokens = tokenizer.encode("Your training text here...")
# Save as binary u32 array
import numpy as np
np.array(tokens, dtype=np.uint32).tofile("data/my_dataset.bin")
Then point your config to the data file, or load it programmatically in your training script.
Option 3: Generate dummy data for testing
For quick testing, oxidizr can generate random tokens:
use oxidizr::data::{LitDataLoader, create_dummy_data};
let tokens = create_dummy_data(128354, 100_000); // vocab_size, num_tokens
let data_loader = LitDataLoader::new(tokens, batch_size, seq_len, device);
Oxidizr supports multiple state-of-the-art architectures:
You can mix and match components. For example, the nano_mamba2.yaml config uses:
Configure hybrid models by specifying which layers use which architecture in your YAML.
cargo run --release --features cuda -- [OPTIONS]
Options:
-f, --config <FILE> Path to YAML configuration file (required)
-d, --data <FILE> Path to tokenized data file (.bin)
--target-device <gpu|cpu> Override target device (default: gpu if available)
--seq-len <N> Override sequence length from config
--batch-size <N> Override batch size from config
--grad-accum <N> Override gradient accumulation from config
--max-steps <N> Override max training steps from config
--gpus <IDS> Comma-separated GPU IDs for multi-GPU (e.g., 0,1,2,3)
--sync-backend <cpu|nccl> Gradient sync backend for multi-GPU (default: cpu)
--prefetch <N> Prefetch N batches in background (default: 0)
--resume <PATH|auto> Resume from checkpoint (.safetensors) or "auto" for latest
--headless Output JSON metrics only (for non-interactive terminals)
--dtype <f32|f16|bf16> Model precision (default: f32)
-h, --help Print help information
# Basic training with default settings
cargo run --release --features cuda -- -f models/nano.yaml
# Force CPU execution
cargo run --release -- -f models/nano.yaml --target-device cpu
# Override batch size and sequence length
cargo run --release --features cuda -- -f models/nano.yaml --batch-size 4 --seq-len 256
# Multi-GPU training (2 GPUs)
cargo run --release --features cuda -- -f models/nano.yaml --gpus 0,1 --sync-backend cpu
# Custom config file
cargo run --release --features cuda -- -f experiments/my_config.yaml
Interactive mode (default):
Headless mode (--headless):
# If progress bar doesn't appear, use headless mode
cargo run --release -- -f models/nano.yaml --headless
# CPU training (no CUDA required)
cargo build --release
cargo run --release -- -f models/nano.yaml --target-device cpu
# GPU training (faster, requires CUDA)
cargo build --release --features cuda
cargo run --release --features cuda -- -f models/nano.yaml --target-device gpu
CPU training is fully functional - just slower. Great for:
Oxidizr supports data-parallel training across multiple GPUs:
# Train on GPUs 0, 1, 2, 3 with CPU backend
cargo run --release --features cuda -- -f models/nano.yaml --gpus 0,1,2,3 --sync-backend cpu
# Train with NCCL backend (faster for 4+ GPUs, requires nccl feature)
cargo run --release --features cuda,nccl -- -f models/nano.yaml --gpus 0,1 --sync-backend nccl
How it works:
Effective batch size = batch_size × gradient_accumulation × num_gpus
model:
# Architecture parameters
hidden_size: 512
num_layers: 8
num_heads: 8
kv_heads: 4 # For GQA (fewer KV heads than Q heads)
vocab_size: 128354 # Llama 3 + splintr agent tokens
max_seq_len: 512
rope_theta: 10000.0
intermediate_size: 2048
trainer:
# Training hyperparameters
learning_rate: 0.0003
batch_size: 2
max_steps: 5000
num_epochs: 2
gradient_accumulation: 1
checkpoint_dir: "./checkpoints"
log_interval: 10
save_interval: 500
load_balance_alpha: 0.0 # MoE load balancing (0.0 = disabled)
Add these fields to use Mamba2 instead of standard attention:
model:
# ... other fields ...
mamba2_num_heads: 48
mamba2_head_dim: 16
mamba2_state_size: 64
mamba2_chunk_size: 64
mamba2_n_groups: 1
mamba2_conv_kernel: 4
mamba2_expand: 2
# CONSTRAINT: hidden_size * mamba2_expand == mamba2_num_heads * mamba2_head_dim
# Example: 384 * 2 = 768 == 48 * 16 ✓
For compressed KV cache and memory efficiency:
model:
# ... other fields ...
kv_latent_dim: 192 # Compressed KV dimension (instead of hidden_size)
q_latent_dim: 192 # Compressed query dimension
d_rope: 16 # RoPE dimension
model:
# ... other fields ...
num_experts: 4 # Total number of experts
experts_per_tok: 2 # Top-K routing (use 2 to prevent expert collapse)
shared_expert_enabled: true
intermediate_size: 1536
trainer:
load_balance_alpha: 0.01 # MoE load balancing loss weight (required > 0 for MoE)
Specify which layers use Mamba vs Attention:
model:
# ... other fields ...
mamba_layers: [0, 1, 2, 4, 5, 6] # These layers use Mamba
# Other layers use MLA + MoE
oxidizr/
├── src/
│ ├── main.rs # CLI entry point
│ ├── config.rs # Configuration loading
│ ├── model.rs # Transformer model
│ ├── mamba.rs # Mamba1 implementation
│ ├── mamba2.rs # Mamba2 with SSD
│ ├── data.rs # Data loader
│ └── trainer.rs # Training loop
├── models/
│ ├── nano.yaml # Llama-style GPT example
│ ├── nano_mamba2.yaml # Hybrid Mamba2 + MLA example
│ └── nano_mamba2_pure.yaml # Pure Mamba2 example
├── data/
│ └── nano-start/ # Educational dataset for learning
└── Cargo.toml
--features cuda)The included nano configs are part of an educational initiative to help users learn LLM training fundamentals. The philosophy:
The data/nano-start/ directory contains a curated dataset designed for learning. It's small enough to train quickly while demonstrating key concepts.
This is guidance, not a requirement. Oxidizr is a general-purpose trainer. The nano examples exist to help you get started - you're free to create any architecture and use any dataset you want.
effective_batch = batch_size × gradient_accumulation × num_gpus
Example: batch_size=2, gradient_accumulation=4, num_gpus=2 → effective batch of 16
Enable async data loading to overlap CPU I/O with GPU compute:
cargo run --release --features cuda -- -f models/nano.yaml --prefetch 2
# Run tests
cargo test
# Build documentation
cargo doc --open
# Lint
cargo clippy
# Format
cargo fmt
MIT License - See LICENSE file for details
Status: Early Development | Version: 0.1.0 | Last Updated: 2025-12-05