| Crates.io | inference-lab |
| lib.rs | inference-lab |
| version | 0.5.0 |
| created_at | 2025-12-05 16:05:25.621652+00 |
| updated_at | 2026-01-24 16:38:21.016262+00 |
| description | High-performance LLM inference simulator for analyzing serving systems |
| homepage | |
| repository | https://github.com/doublewordai/inference-lab |
| max_upload_size | |
| id | 1968580 |
| size | 360,410 |
LLM inference simulator for analyzing serving systems. Simulates GPU clusters serving LLM inference workloads with realistic performance modeling.
inference-lab uses discrete-event simulation to model the behavior of a
multi-GPU node serving LLM inference requests with the vLLM library. It
contains a facsimile of the vLLM queueing, scheduling, and execution logic,
with only the actual model inference replaced by a performance model based on
the supplied GPU specs and model architecture.
Within each simulation step, the simulator:
Caveats:
cargo add inference-lab
npm install @doublewordai/inference-lab
cargo install inference-lab
Note: The CLI tool is only available if you install it using cargo install inference-lab (see above).
# Run with default configuration
inference-lab --config configs/config.toml
# Example output shows TTFT, E2E latency, throughput, and utilization metrics
use inference_lab::simulation::Simulator;
use inference_lab::config::SimulationConfig;
let config = SimulationConfig::from_file("config.toml")?;
let mut simulator = Simulator::new(config);
let results = simulator.run();
println!("Mean TTFT: {:.2}ms", results.ttft_mean * 1000.0);
println!("P99 E2E: {:.2}ms", results.e2e_p99 * 1000.0);
println!("Throughput: {:.1} tok/s", results.throughput);
import init, { run_simulation } from '@doubleword/inference-lab';
await init();
const config = {
hardware: {
name: "H100",
compute_flops: 1.513e15,
memory_bandwidth: 3.35e12,
memory_capacity: 85899345920,
bytes_per_param: 2
},
model: {
name: "Llama-3-70B",
num_parameters: 70000000000,
num_layers: 80,
hidden_dim: 8192,
num_heads: 64,
num_kv_heads: 8,
max_seq_len: 8192
},
scheduler: {
max_num_batched_tokens: 8192,
max_num_seqs: 256,
policy: "fcfs",
enable_chunked_prefill: true,
block_size: 16
},
workload: {
arrival_pattern: "poisson",
arrival_rate: 5.0,
num_requests: 400,
seed: 42,
input_len_dist: {
type: "lognormal",
mean: 6.9,
std_dev: 0.7
},
output_len_dist: {
type: "lognormal",
mean: 5.3,
std_dev: 0.8
}
}
};
const results = run_simulation(JSON.stringify(config));
console.log('TTFT P50:', results.metrics.ttft_p50);
console.log('Throughput:', results.metrics.output_tokens_per_sec);
Configuration files use TOML format and specify:
Example configurations are in the configs/ directory:
config.toml - Default H100 + Llama-3-70B setuptest_blog.toml - Closed-loop benchmark (64 users)qwen3_30b_a3b.toml - Qwen model configurationcargo build --release
./target/release/inference-lab --config configs/config.toml
npm run build
# Outputs to pkg/ directory
# Publish to npm (requires authentication)
npm run build
npm publish --access public
# Publish Rust crate
cargo publish
inference-lab/
├── src/
│ ├── simulation/ # Core simulator logic
│ ├── scheduler/ # Scheduling policies (FCFS, Priority, SJF)
│ ├── compute/ # Performance calculations
│ ├── kv_cache/ # KV cache management
│ ├── request/ # Request generation and tracking
│ ├── metrics/ # Performance metrics collection
│ ├── config/ # Configuration structures
│ ├── lib.rs # Library root
│ ├── main.rs # CLI entry point
│ └── wasm.rs # WebAssembly bindings
├── configs/ # Example configurations
├── Cargo.toml # Rust package manifest
└── package.json # npm package manifest
The simulator tracks:
Results include percentiles (p50, p90, p95, p99) and means.
MIT