| Crates.io | rusty-llm-jury |
| lib.rs | rusty-llm-jury |
| version | 0.1.0 |
| created_at | 2025-10-06 23:39:43.095175+00 |
| updated_at | 2025-10-06 23:39:43.095175+00 |
| description | A Rust CLI tool for estimating success rates when using LLM judges for evaluation |
| homepage | https://github.com/udapy/rusty-llm-jury |
| repository | https://github.com/udapy/rusty-llm-jury |
| max_upload_size | |
| id | 1870999 |
| size | 134,904 |
A Rust based CLI tool for estimating success rates when using LLM judges for evaluation.
When using Large Language Models (LLMs) as judges to evaluate other models or systems, the judge's own biases and errors can significantly impact the reliability of the evaluation. rusty-llm-jury provides a command-line tool to estimate the true success rate of your system by correcting for LLM judge bias using bootstrap confidence intervals.
cargo install llm-jury
git clone https://github.com/udapy/rusty-llm-jury.git
cd rusty-llm-jury
cargo install --path .
# Estimate true success rate with bias correction
llm-jury estimate \
--test-labels "1,1,0,0,1,0,1,0" \
--test-preds "1,0,0,1,1,0,1,0" \
--unlabeled-preds "1,1,0,1,0,1,0,1" \
--bootstrap-iterations 20000 \
--confidence-level 0.95
# Output:
# Estimated true pass rate: 0.625
# 95% Confidence interval: [0.234, 0.891]
# Load data from CSV files
llm-jury estimate \
--test-labels-file test_labels.csv \
--test-preds-file test_preds.csv \
--unlabeled-preds-file unlabeled_preds.csv
# Run TPR/TNR sensitivity analysis
llm-jury synth-experiment \
--true-failure-rate 0.1 \
--tpr-range 0.5,0.95 \
--tnr-range 0.5,0.95 \
--n-points 10 \
--output results.json
The tool implements a bias correction method based on the following steps:
θ̂ = (p_obs + TNR - 1) / (TPR + TNR - 1)
where p_obs is the observed pass rate from the judgellm-jury estimateEstimate true pass rate with bias correction and confidence intervals.
Options:
--test-labels <VALUES>: Comma-separated 0/1 values (human labels on test set)--test-preds <VALUES>: Comma-separated 0/1 values (judge predictions on test set)--unlabeled-preds <VALUES>: Comma-separated 0/1 values (judge predictions on unlabeled data)--test-labels-file <FILE>: Load test labels from CSV file--test-preds-file <FILE>: Load test predictions from CSV file--unlabeled-preds-file <FILE>: Load unlabeled predictions from CSV file--bootstrap-iterations <N>: Number of bootstrap iterations (default: 20000)--confidence-level <LEVEL>: Confidence level between 0 and 1 (default: 0.95)--output <FILE>: Save results to JSON file--format <FORMAT>: Output format: text, json, csv (default: text)llm-jury synth-experimentRun synthetic sensitivity experiments.
Options:
--true-failure-rate <RATE>: True failure rate in unlabeled data (default: 0.1)--tpr-range <MIN,MAX>: TPR range to test (default: 0.5,1.0)--tnr-range <MIN,MAX>: TNR range to test (default: 0.5,1.0)--n-points <N>: Number of points in each range (default: 10)--n-test-positive <N>: Number of positive test examples (default: 100)--n-test-negative <N>: Number of negative test examples (default: 100)--n-unlabeled <N>: Number of unlabeled samples (default: 1000)--bootstrap-iterations <N>: Bootstrap iterations (default: 2000)--seed <SEED>: Random seed for reproducibility--output <FILE>: Output file (JSON or CSV based on extension)# Step 1: Collect your data
echo "1,0,1,1,0,0,1,0" > test_labels.csv # Human evaluation
echo "1,0,0,1,1,0,1,0" > test_preds.csv # LLM judge on same data
echo "1,1,0,1,0,1,0,1,1,0" > unlabeled.csv # LLM judge on target data
# Step 2: Estimate true success rate
llm-jury estimate \
--test-labels-file test_labels.csv \
--test-preds-file test_preds.csv \
--unlabeled-preds-file unlabeled.csv \
--format json \
--output results.json
# Step 3: View results
cat results.json
# Analyze how estimation varies with judge accuracy
llm-jury synth-experiment \
--true-failure-rate 0.2 \
--tpr-range 0.6,0.95 \
--tnr-range 0.6,0.95 \
--n-points 15 \
--seed 42 \
--output sensitivity_analysis.json
# Clone repository
git clone https://github.com/ai-evals-course/rusty-llm-jury.git
cd rusty-llm-jury
# Build release version
make build
# Or using cargo directly
cargo build --release
# The binary will be at target/release/llm-jury
# Format code
make fmt
# Run lints
make clippy
# Run tests
make test
# All checks
make check
Run the test suite:
cargo test
Run with coverage (requires cargo-tarpaulin):
cargo install cargo-tarpaulin
cargo tarpaulin --out html
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
git checkout -b feature/amazing-feature)git commit -m 'Add some amazing feature')git push origin feature/amazing-feature)This project is licensed under the MIT License - see the LICENSE file for details.
Note: This tool assumes that your LLM judge performs better than random chance (TPR + TNR > 1). If your judge's accuracy is too low, the correction method may not be applicable.