| Crates.io | llmnop |
| lib.rs | llmnop |
| version | 0.6.0 |
| created_at | 2025-07-10 04:38:21.684637+00 |
| updated_at | 2026-01-19 07:46:54.275905+00 |
| description | A command-line tool for benchmarking the performance of LLM inference endpoints. |
| homepage | https://github.com/jpreagan/llmnop |
| repository | https://github.com/jpreagan/llmnop |
| max_upload_size | |
| id | 1745875 |
| size | 5,188,245 |
Installation | Quick Start | Metrics | Examples
llmnop is a fast, lightweight CLI that benchmarks LLM inference endpoints with detailed latency and throughput metrics.
It's a single binary with no dependencies, just download and run. Use it to compare inference providers, validate deployment performance, tune serving parameters, or establish baselines before and after changes.
Homebrew:
brew install jpreagan/tap/llmnop
Or with the shell installer:
curl -sSfL https://github.com/jpreagan/llmnop/releases/latest/download/llmnop-installer.sh | sh
The shell installer places llmnop in ~/.local/bin. Make sure that's on your PATH.
llmnop --url http://localhost:8000/v1 \
--api-key token-abc123 \
--model Qwen/Qwen3-4B-Instruct-2507 \
--mean-output-tokens 150
Results print to stdout and save to result_outputs/.
| Metric | Description |
|---|---|
| TTFT | Time to first token - how long until streaming begins |
| TTFO | Time to first output token - excludes reasoning/thinking tokens |
| Inter-token latency | Average gap between tokens during generation |
| Throughput | Tokens per second during the generation window |
| End-to-end latency | Total request time from start to finish |
For reasoning models, TTFT includes thinking tokens. TTFO measures time until actual output begins, giving you the user-perceived latency.
| Flag | Description |
|---|---|
--url |
Base URL (e.g., http://localhost:8000/v1) |
--api-key |
API key for authentication |
--model, -m |
Model name to benchmark |
Control input and output token counts to simulate realistic workloads:
| Flag | Default | Description |
|---|---|---|
--mean-input-tokens |
550 | Target prompt length in tokens |
--stddev-input-tokens |
0 | Add variance to input length |
--mean-output-tokens |
none | Cap output length (recommended for consistent benchmarks) |
--stddev-output-tokens |
0 | Add variance to output length |
| Flag | Default | Description |
|---|---|---|
--max-num-completed-requests |
10 | Total requests to complete |
--num-concurrent-requests |
1 | Parallel request count |
--timeout |
600 | Request timeout in seconds |
By default, llmnop uses a local Hugging Face tokenizer matching --model to count tokens.
| Flag | Description |
|---|---|
--tokenizer |
Use a different HF tokenizer (when model name doesn't match Hugging Face) |
--use-server-token-count |
Use server-reported usage instead of local tokenization |
Use --use-server-token-count when you trust the server's token counts and want to avoid downloading tokenizer files. The server must return usage data or llmnop will error.
| Flag | Default | Description |
|---|---|---|
--api |
chat | API type: chat or responses |
--results-dir |
result_outputs | Where to save JSON results |
--no-progress |
false | Hide progress bar (useful for CI) |
Load test with concurrency:
llmnop --url http://localhost:8000/v1 --api-key token-abc123 \
--model Qwen/Qwen3-4B-Instruct-2507 \
--num-concurrent-requests 10 \
--max-num-completed-requests 100
Controlled benchmark with fixed output length:
llmnop --url http://localhost:8000/v1 --api-key token-abc123 \
--model Qwen/Qwen3-4B-Instruct-2507 \
--mean-output-tokens 150
Responses API:
llmnop --api responses --url http://localhost:8000/v1 --api-key token-abc123 \
--model openai/gpt-oss-120b
Custom tokenizer when model name doesn't match Hugging Face:
llmnop --url http://localhost:11434/v1 --api-key ollama
--model gpt-oss:20b \
--tokenizer openai/gpt-oss-20b
Cross-model comparison with neutral tokenizer:
When comparing different models, use a consistent tokenizer so token counts are comparable:
llmnop --url http://localhost:8000/v1 --api-key token-abc123 \
--model Qwen/Qwen3-4B-Instruct-2507 \
--tokenizer hf-internal-testing/llama-tokenizer
Each run produces two JSON files in the results directory:
| File | Contents |
|---|---|
{model}_{input}_{output}_summary.json |
Aggregated statistics with percentiles |
{model}_{input}_{output}_individual_responses.json |
Per-request timing data |
The summary includes full statistical breakdowns (p25/p50/p75/p90/p95/p99, mean, min, max, stddev) for all metrics. Individual responses let you analyze distributions or identify outliers.