| Crates.io | miyabi-benchmark |
| lib.rs | miyabi-benchmark |
| version | 0.1.2 |
| created_at | 2025-11-22 09:35:09.180982+00 |
| updated_at | 2025-11-22 09:35:09.180982+00 |
| description | Benchmark evaluation for Miyabi - SWE-bench Pro, AgentBench, HAL, Galileo |
| homepage | https://github.com/ShunsukeHayashi/Miyabi |
| repository | https://github.com/ShunsukeHayashi/Miyabi |
| max_upload_size | |
| id | 1945110 |
| size | 176,284 |
World-standard benchmark evaluation framework for the Miyabi AI development platform.
miyabi-benchmark provides a comprehensive evaluation framework for benchmarking Miyabi's autonomous development capabilities against world-standard datasets. It supports parallel evaluation, detailed reporting, and integration with the Miyabi worktree system for isolated execution.
Supported Benchmarks:
Key Capabilities:
django/django)Add to your Cargo.toml:
[dependencies]
miyabi-benchmark = "0.1.0"
Or install the CLI tool:
cargo install miyabi-benchmark --features cli
use miyabi_benchmark::{
dataset::SWEBenchDataset,
evaluator::SWEBenchProEvaluator,
reporter::EvaluationReporter,
};
use anyhow::Result;
#[tokio::main]
async fn main() -> Result<()> {
// 1. Load dataset
let dataset = SWEBenchDataset::load_from_json("swebench_pro_test.json")?;
println!("Loaded {} instances", dataset.len());
// 2. Filter by language (optional)
let python_instances = dataset.filter_by_language("python");
println!("Python instances: {}", python_instances.len());
// 3. Create evaluator
let evaluator = SWEBenchProEvaluator::new()?;
// 4. Run evaluation (parallel)
let results = evaluator.evaluate_all(&python_instances).await?;
// 5. Generate report
let reporter = EvaluationReporter::new();
let report = reporter.generate_report(&results);
println!("Success rate: {:.2}%", report.success_rate * 100.0);
println!("Total duration: {:.2}s", report.total_duration_secs);
// 6. Save results
reporter.save_to_json(&results, "evaluation_results.json")?;
Ok(())
}
# Download SWE-bench Pro dataset
miyabi-benchmark download-dataset --benchmark swe-bench-pro
# Run evaluation on all instances
miyabi-benchmark evaluate --dataset swebench_pro_test.json --output results.json
# Run with custom config
miyabi-benchmark evaluate \
--dataset swebench_pro_test.json \
--output results.json \
--concurrency 10 \
--timeout 3600 \
--model miyabi-v1.0.0
# Filter by language
miyabi-benchmark evaluate \
--dataset swebench_pro_test.json \
--language python \
--output python_results.json
# Filter by repository
miyabi-benchmark evaluate \
--dataset swebench_pro_test.json \
--repo django/django \
--output django_results.json
# Generate report from existing results
miyabi-benchmark report \
--input results.json \
--output report.html \
--format html
Dataset: 731 software engineering task instances from popular open-source projects
Format:
{
"instance_id": "django__django-12345",
"repo": "django/django",
"version": "3.2",
"problem_statement": "Fix bug in QuerySet.filter()...",
"hints_text": "Check the SQL generation logic...",
"test_patch": "diff --git a/tests/...",
"patch": "diff --git a/django/db/..."
}
Evaluation Metrics:
Dataset: 8 environments covering diverse agent capabilities
Environments:
Dataset: Cost-efficient holistic evaluation across 9 benchmarks
Benchmarks:
Focus: Optimize for cost-per-token while maintaining accuracy
Dataset: Enterprise-grade evaluation for 5 industries
Industries:
Metrics: Accuracy, Latency, Cost, Safety, Compliance
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SWEBenchDataset โ โ Load & Filter
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SWEBenchProEvaluator โ โ Parallel Eval
โ - Concurrency: 5 โ
โ - Timeout: 30 min โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ WorktreeManager โ โ Isolated Execution
โ - Per-instance sandbox โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CoordinatorAgent โ โ Generate Fix
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Patch Generation โ โ Unified Diff
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Test Validation โ โ Run Tests
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ EvaluationReporter โ โ Generate Report
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
{
"model": "miyabi-v1.0.0",
"benchmark": "swe-bench-pro",
"total_instances": 731,
"successful": 584,
"failed": 147,
"success_rate": 0.799,
"avg_duration_secs": 245.3,
"total_duration_secs": 179353.0,
"metrics": {
"pass@1": 0.799,
"avg_tokens_per_fix": 12500,
"cost_per_fix_usd": 0.05
}
}
# Run all tests
cargo test --package miyabi-benchmark
# Run evaluator tests
cargo test --package miyabi-benchmark evaluator
# Run dataset tests
cargo test --package miyabi-benchmark dataset
# Integration tests (requires dataset)
cargo test --package miyabi-benchmark --test integration -- --ignored
miyabi-types, miyabi-core, miyabi-agents, miyabi-worktreetokio, async-traitserde, serde_jsonreqwest (for HuggingFace API)clap, indicatif (optional, feature-gated)anyhow, thiserror, chrono, tracingmiyabi-agents - Agent implementations for evaluationmiyabi-worktree - Isolated execution environmentmiyabi-types - Shared type definitionsmiyabi-core - Core utilitiesSubmit your results to official leaderboards:
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Licensed under the MIT License. See LICENSE for details.
Part of the Miyabi Framework - Autonomous AI Development Platform