| Crates.io | trustformers-optim |
| lib.rs | trustformers-optim |
| version | 0.1.0-alpha.1 |
| created_at | 2025-11-09 10:14:34.456595+00 |
| updated_at | 2025-11-09 10:14:34.456595+00 |
| description | Optimizers for TrustformeRS |
| homepage | |
| repository | https://github.com/cool-japan/trustformers |
| max_upload_size | |
| id | 1923938 |
| size | 2,541,776 |
Optimization algorithms and learning rate schedulers for training transformer models.
This crate provides comprehensive optimization infrastructure including state-of-the-art optimizers, learning rate schedulers, and distributed optimization techniques. It includes implementations of Adam, AdamW, SGD, LAMB, AdaFactor, and all three ZeRO optimization stages.
use trustformers_optim::{
optimizers::{AdamW, AdamWConfig},
schedulers::{LinearScheduler, SchedulerConfig},
Optimizer,
};
// Create AdamW optimizer
let config = AdamWConfig {
lr: 5e-5,
betas: (0.9, 0.999),
eps: 1e-8,
weight_decay: 0.01,
correct_bias: true,
};
let mut optimizer = AdamW::new(config)?;
// Create learning rate scheduler
let scheduler_config = SchedulerConfig {
num_warmup_steps: 1000,
num_training_steps: 10000,
};
let scheduler = LinearScheduler::new(scheduler_config);
// Training loop
for step in 0..num_steps {
// Forward pass
let loss = model.forward(&batch)?;
// Backward pass
let gradients = loss.backward()?;
// Update learning rate
let lr = scheduler.get_lr(step);
optimizer.set_lr(lr);
// Optimizer step
optimizer.step(&mut model.parameters(), &gradients)?;
optimizer.zero_grad();
}
use trustformers_optim::{
distributed::{ZeroOptimizer, ZeroConfig, ZeroStage},
optimizers::AdamW,
};
// Configure ZeRO
let zero_config = ZeroConfig {
stage: ZeroStage::Three,
partition_gradients: true,
contiguous_gradients: true,
overlap_comm: true,
reduce_scatter: true,
cpu_offload: false,
};
// Wrap optimizer with ZeRO
let base_optimizer = AdamW::new(adam_config)?;
let optimizer = ZeroOptimizer::new(
base_optimizer,
model,
zero_config,
process_group,
)?;
trustformers-optim/
├── src/
│ ├── optimizers/ # Optimizer implementations
│ │ ├── sgd.rs # SGD optimizer
│ │ ├── adam.rs # Adam & AdamW
│ │ ├── lamb.rs # LAMB optimizer
│ │ └── adafactor.rs # AdaFactor optimizer
│ ├── schedulers/ # Learning rate schedulers
│ ├── distributed/ # Distributed optimization
│ │ ├── zero.rs # ZeRO implementation
│ │ └── utils.rs # Communication utilities
│ └── traits.rs # Core traits
| Model Size | Standard | ZeRO-1 | ZeRO-2 | ZeRO-3 |
|---|---|---|---|---|
| 1.5B params | 24 GB | 16 GB | 12 GB | 8 GB |
| 7B params | 112 GB | 75 GB | 56 GB | 28 GB |
| 175B params | 2.8 TB | 1.9 TB | 1.4 TB | 700 GB |
// Recommended starting points
AdamW: lr=5e-5, weight_decay=0.01, warmup=10% of steps
LAMB: lr=2e-3, weight_decay=0.01, warmup=10% of steps
AdaFactor: lr=1e-3, no weight_decay, warmup=10% of steps
MIT OR Apache-2.0