inference-lab

Crates.io	inference-lab
lib.rs	inference-lab
version	0.5.0
created_at	2025-12-05 16:05:25.621652+00
updated_at	2026-01-24 16:38:21.016262+00
description	High-performance LLM inference simulator for analyzing serving systems
homepage
repository	https://github.com/doublewordai/inference-lab
max_upload_size
id	1968580
size	360,410

Fergus (fergusfinn)

documentation

README

Inference Lab

Documentation

LLM inference simulator for analyzing serving systems. Simulates GPU clusters serving LLM inference workloads with realistic performance modeling.

Features

Accurate Performance Modeling: Models compute (FLOPS) and memory bandwidth constraints
Multiple Scheduling Policies: FCFS, Priority, SJF, and more
Chunked Prefill: Simulates realistic request interleaving
KV Cache Management: Models GPU memory and KV cache utilization
Workload Generation: Supports Poisson, Gamma, and closed-loop patterns
WebAssembly Support: Run simulations in the browser via WASM
CLI Tool: Standalone binary for command-line usage

How does it work?

inference-lab uses discrete-event simulation to model the behavior of a multi-GPU node serving LLM inference requests with the vLLM library. It contains a facsimile of the vLLM queueing, scheduling, and execution logic, with only the actual model inference replaced by a performance model based on the supplied GPU specs and model architecture.

Within each simulation step, the simulator:

Processes any newly arrived requests, adding them to the scheduling queue.
Schedules requests to serve based on the selected scheduling policy.
Calculates the compute and memory bandwidth usage for the workload that the scheduled requests represent, and the theoretical time required to execute the workload on the specified hardware.
Increments the simulation time by the calculated execution time, updating the state of all requests accordingly.

Caveats:

This assumes perfectly optimized GPU execution, ignoring kernel launch overheads, poorly optimized kernels, application overhead, thermals, etc.
We simulate tensor parallel execution, but don't model multi-GPU communication overheads.

Installation

As a Rust Library

cargo add inference-lab

As an npm Package (WASM)

npm install @doublewordai/inference-lab

CLI Tool

cargo install inference-lab

Usage

CLI

Note: The CLI tool is only available if you install it using cargo install inference-lab (see above).

# Run with default configuration
inference-lab --config configs/config.toml

# Example output shows TTFT, E2E latency, throughput, and utilization metrics

Rust Library

use inference_lab::simulation::Simulator;
use inference_lab::config::SimulationConfig;

let config = SimulationConfig::from_file("config.toml")?;
let mut simulator = Simulator::new(config);
let results = simulator.run();

println!("Mean TTFT: {:.2}ms", results.ttft_mean * 1000.0);
println!("P99 E2E: {:.2}ms", results.e2e_p99 * 1000.0);
println!("Throughput: {:.1} tok/s", results.throughput);

WebAssembly

import init, { run_simulation } from '@doubleword/inference-lab';

await init();

const config = {
  hardware: {
    name: "H100",
    compute_flops: 1.513e15,
    memory_bandwidth: 3.35e12,
    memory_capacity: 85899345920,
    bytes_per_param: 2
  },
  model: {
    name: "Llama-3-70B",
    num_parameters: 70000000000,
    num_layers: 80,
    hidden_dim: 8192,
    num_heads: 64,
    num_kv_heads: 8,
    max_seq_len: 8192
  },
  scheduler: {
    max_num_batched_tokens: 8192,
    max_num_seqs: 256,
    policy: "fcfs",
    enable_chunked_prefill: true,
    block_size: 16
  },
  workload: {
    arrival_pattern: "poisson",
    arrival_rate: 5.0,
    num_requests: 400,
    seed: 42,
    input_len_dist: {
      type: "lognormal",
      mean: 6.9,
      std_dev: 0.7
    },
    output_len_dist: {
      type: "lognormal",
      mean: 5.3,
      std_dev: 0.8
    }
  }
};

const results = run_simulation(JSON.stringify(config));
console.log('TTFT P50:', results.metrics.ttft_p50);
console.log('Throughput:', results.metrics.output_tokens_per_sec);

Configuration

Configuration files use TOML format and specify:

Hardware: GPU specs (FLOPS, bandwidth, VRAM)
Model: LLM architecture (parameters, layers, heads)
Scheduler: Policies, max tokens, chunked prefill settings
Workload: Request arrival patterns and distributions

Example configurations are in the configs/ directory:

config.toml - Default H100 + Llama-3-70B setup
test_blog.toml - Closed-loop benchmark (64 users)
qwen3_30b_a3b.toml - Qwen model configuration

Building

Native Binary

cargo build --release
./target/release/inference-lab --config configs/config.toml

WASM Package

npm run build
# Outputs to pkg/ directory

Publishing

# Publish to npm (requires authentication)
npm run build
npm publish --access public

# Publish Rust crate
cargo publish

Project Structure

inference-lab/
├── src/
│   ├── simulation/     # Core simulator logic
│   ├── scheduler/      # Scheduling policies (FCFS, Priority, SJF)
│   ├── compute/        # Performance calculations
│   ├── kv_cache/       # KV cache management
│   ├── request/        # Request generation and tracking
│   ├── metrics/        # Performance metrics collection
│   ├── config/         # Configuration structures
│   ├── lib.rs          # Library root
│   ├── main.rs         # CLI entry point
│   └── wasm.rs         # WebAssembly bindings
├── configs/            # Example configurations
├── Cargo.toml          # Rust package manifest
└── package.json        # npm package manifest

Metrics

The simulator tracks:

TTFT (Time to First Token): Prefill latency
E2E (End-to-End): Total request latency
TPOT (Time Per Output Token): Decode latency per token
Throughput: Tokens generated per second
Utilization: Compute and memory bandwidth usage
KV Cache: Memory utilization over time

Results include percentiles (p50, p90, p95, p99) and means.

License

MIT

Repository

https://github.com/doublewordai/inference-lab

Commit count: 62