llmnop

Crates.iollmnop
lib.rsllmnop
version0.6.0
created_at2025-07-10 04:38:21.684637+00
updated_at2026-01-19 07:46:54.275905+00
descriptionA command-line tool for benchmarking the performance of LLM inference endpoints.
homepagehttps://github.com/jpreagan/llmnop
repositoryhttps://github.com/jpreagan/llmnop
max_upload_size
id1745875
size5,188,245
James Reagan (jpreagan)

documentation

README

llmnop

Installation | Quick Start | Metrics | Examples

llmnop is a fast, lightweight CLI that benchmarks LLM inference endpoints with detailed latency and throughput metrics.

It's a single binary with no dependencies, just download and run. Use it to compare inference providers, validate deployment performance, tune serving parameters, or establish baselines before and after changes.

Installation

Homebrew:

brew install jpreagan/tap/llmnop

Or with the shell installer:

curl -sSfL https://github.com/jpreagan/llmnop/releases/latest/download/llmnop-installer.sh | sh

The shell installer places llmnop in ~/.local/bin. Make sure that's on your PATH.

Quick Start

llmnop --url http://localhost:8000/v1 \
  --api-key token-abc123 \
  --model Qwen/Qwen3-4B-Instruct-2507 \
  --mean-output-tokens 150

Results print to stdout and save to result_outputs/.

What It Measures

Metric Description
TTFT Time to first token - how long until streaming begins
TTFO Time to first output token - excludes reasoning/thinking tokens
Inter-token latency Average gap between tokens during generation
Throughput Tokens per second during the generation window
End-to-end latency Total request time from start to finish

For reasoning models, TTFT includes thinking tokens. TTFO measures time until actual output begins, giving you the user-perceived latency.

Configuration

Required

Flag Description
--url Base URL (e.g., http://localhost:8000/v1)
--api-key API key for authentication
--model, -m Model name to benchmark

Request Shaping

Control input and output token counts to simulate realistic workloads:

Flag Default Description
--mean-input-tokens 550 Target prompt length in tokens
--stddev-input-tokens 0 Add variance to input length
--mean-output-tokens none Cap output length (recommended for consistent benchmarks)
--stddev-output-tokens 0 Add variance to output length

Load Testing

Flag Default Description
--max-num-completed-requests 10 Total requests to complete
--num-concurrent-requests 1 Parallel request count
--timeout 600 Request timeout in seconds

Tokenization

By default, llmnop uses a local Hugging Face tokenizer matching --model to count tokens.

Flag Description
--tokenizer Use a different HF tokenizer (when model name doesn't match Hugging Face)
--use-server-token-count Use server-reported usage instead of local tokenization

Use --use-server-token-count when you trust the server's token counts and want to avoid downloading tokenizer files. The server must return usage data or llmnop will error.

Output

Flag Default Description
--api chat API type: chat or responses
--results-dir result_outputs Where to save JSON results
--no-progress false Hide progress bar (useful for CI)

Examples

Load test with concurrency:

llmnop --url http://localhost:8000/v1 --api-key token-abc123 \
  --model Qwen/Qwen3-4B-Instruct-2507 \
  --num-concurrent-requests 10 \
  --max-num-completed-requests 100

Controlled benchmark with fixed output length:

llmnop --url http://localhost:8000/v1 --api-key token-abc123 \
  --model Qwen/Qwen3-4B-Instruct-2507 \
  --mean-output-tokens 150

Responses API:

llmnop --api responses --url http://localhost:8000/v1 --api-key token-abc123 \
  --model openai/gpt-oss-120b

Custom tokenizer when model name doesn't match Hugging Face:

llmnop --url http://localhost:11434/v1 --api-key ollama
  --model gpt-oss:20b \
  --tokenizer openai/gpt-oss-20b

Cross-model comparison with neutral tokenizer:

When comparing different models, use a consistent tokenizer so token counts are comparable:

llmnop --url http://localhost:8000/v1 --api-key token-abc123 \
  --model Qwen/Qwen3-4B-Instruct-2507 \
  --tokenizer hf-internal-testing/llama-tokenizer

Output Files

Each run produces two JSON files in the results directory:

File Contents
{model}_{input}_{output}_summary.json Aggregated statistics with percentiles
{model}_{input}_{output}_individual_responses.json Per-request timing data

The summary includes full statistical breakdowns (p25/p50/p75/p90/p95/p99, mean, min, max, stddev) for all metrics. Individual responses let you analyze distributions or identify outliers.

License

Apache License 2.0

Commit count: 84

cargo fmt