miyabi-benchmark

Crates.iomiyabi-benchmark
lib.rsmiyabi-benchmark
version0.1.2
created_at2025-11-22 09:35:09.180982+00
updated_at2025-11-22 09:35:09.180982+00
descriptionBenchmark evaluation for Miyabi - SWE-bench Pro, AgentBench, HAL, Galileo
homepagehttps://github.com/ShunsukeHayashi/Miyabi
repositoryhttps://github.com/ShunsukeHayashi/Miyabi
max_upload_size
id1945110
size176,284
Shunsuke Hayashi (ShunsukeHayashi)

documentation

README

miyabi-benchmark

World-standard benchmark evaluation framework for the Miyabi AI development platform.

Crates.io Documentation License

๐Ÿ“‹ Overview

miyabi-benchmark provides a comprehensive evaluation framework for benchmarking Miyabi's autonomous development capabilities against world-standard datasets. It supports parallel evaluation, detailed reporting, and integration with the Miyabi worktree system for isolated execution.

Supported Benchmarks:

  • ๐Ÿ† SWE-bench Pro (ScaleAI) - 731 software engineering task instances
  • ๐Ÿค– AgentBench (THUDM) - 8 agent capability environments
  • ๐Ÿ“Š HAL (Princeton) - Cost-efficient holistic evaluation across 9 benchmarks
  • ๐ŸŒŸ Galileo Agent Leaderboard v2 - Enterprise-grade evaluation for 5 industries

Key Capabilities:

  • ๐Ÿ“ฆ Dataset Management: Load, filter, and manage benchmark datasets
  • โš™๏ธ Parallel Evaluation: Concurrent instance processing with configurable concurrency
  • ๐Ÿ” Isolated Execution: Git worktree-based sandboxing for each evaluation
  • โฑ๏ธ Timeout Management: Configurable timeout per instance (default: 30 min)
  • ๐Ÿ“ˆ Statistical Reporting: Success rate, duration, and performance metrics
  • ๐ŸŽฏ Patch Generation: Unified diff format for submission to official leaderboards

๐Ÿš€ Features

SWE-bench Pro Support

  • Dataset Loading: Load from JSON (HuggingFace format)
  • Language Filtering: Filter by Python, JavaScript, TypeScript, Go, Rust, etc.
  • Repository Filtering: Focus on specific repos (e.g., django/django)
  • Patch Generation: Generate unified diffs for official evaluation
  • Test Validation: Run test suites to verify fixes

Evaluation Pipeline

  1. Setup: Create isolated worktree for each instance
  2. Execution: Run CoordinatorAgent to generate fix
  3. Patch: Generate unified diff patch
  4. Validation: Run tests to verify correctness
  5. Reporting: Collect metrics and generate report

Performance Tracking

  • Success Rate: Percentage of correctly fixed instances
  • Timing: Min, max, average, and total duration
  • Failure Analysis: Error categorization and debugging info
  • Comparison: Benchmark against state-of-the-art agents

๐Ÿ“ฆ Installation

Add to your Cargo.toml:

[dependencies]
miyabi-benchmark = "0.1.0"

Or install the CLI tool:

cargo install miyabi-benchmark --features cli

๐Ÿ”ง Usage

As a Library

use miyabi_benchmark::{
    dataset::SWEBenchDataset,
    evaluator::SWEBenchProEvaluator,
    reporter::EvaluationReporter,
};
use anyhow::Result;

#[tokio::main]
async fn main() -> Result<()> {
    // 1. Load dataset
    let dataset = SWEBenchDataset::load_from_json("swebench_pro_test.json")?;
    println!("Loaded {} instances", dataset.len());

    // 2. Filter by language (optional)
    let python_instances = dataset.filter_by_language("python");
    println!("Python instances: {}", python_instances.len());

    // 3. Create evaluator
    let evaluator = SWEBenchProEvaluator::new()?;

    // 4. Run evaluation (parallel)
    let results = evaluator.evaluate_all(&python_instances).await?;

    // 5. Generate report
    let reporter = EvaluationReporter::new();
    let report = reporter.generate_report(&results);

    println!("Success rate: {:.2}%", report.success_rate * 100.0);
    println!("Total duration: {:.2}s", report.total_duration_secs);

    // 6. Save results
    reporter.save_to_json(&results, "evaluation_results.json")?;

    Ok(())
}

As a CLI Tool

# Download SWE-bench Pro dataset
miyabi-benchmark download-dataset --benchmark swe-bench-pro

# Run evaluation on all instances
miyabi-benchmark evaluate --dataset swebench_pro_test.json --output results.json

# Run with custom config
miyabi-benchmark evaluate \
  --dataset swebench_pro_test.json \
  --output results.json \
  --concurrency 10 \
  --timeout 3600 \
  --model miyabi-v1.0.0

# Filter by language
miyabi-benchmark evaluate \
  --dataset swebench_pro_test.json \
  --language python \
  --output python_results.json

# Filter by repository
miyabi-benchmark evaluate \
  --dataset swebench_pro_test.json \
  --repo django/django \
  --output django_results.json

# Generate report from existing results
miyabi-benchmark report \
  --input results.json \
  --output report.html \
  --format html

๐Ÿ“Š Benchmark Details

SWE-bench Pro (ScaleAI)

Dataset: 731 software engineering task instances from popular open-source projects

Format:

{
  "instance_id": "django__django-12345",
  "repo": "django/django",
  "version": "3.2",
  "problem_statement": "Fix bug in QuerySet.filter()...",
  "hints_text": "Check the SQL generation logic...",
  "test_patch": "diff --git a/tests/...",
  "patch": "diff --git a/django/db/..."
}

Evaluation Metrics:

  • Accuracy: Percentage of correctly fixed instances
  • Pass@1: Success rate on first attempt
  • Avg. Duration: Average time per instance
  • Token Efficiency: Tokens used per successful fix

AgentBench (THUDM)

Dataset: 8 environments covering diverse agent capabilities

Environments:

  1. OS Interaction: Shell commands, file operations
  2. Database Queries: SQL generation and execution
  3. Knowledge Graph: Entity/relation reasoning
  4. Digital Card Game: Multi-step planning
  5. Lateral Thinking: Creative problem-solving
  6. House-Holding: Common-sense reasoning
  7. Web Shopping: Web interaction and decision-making
  8. Web Browsing: Information retrieval

HAL (Princeton)

Dataset: Cost-efficient holistic evaluation across 9 benchmarks

Benchmarks:

  • MMLU, GSM8K, HumanEval, MATH, DROP, HellaSwag, ARC, TruthfulQA, BigBench-Hard

Focus: Optimize for cost-per-token while maintaining accuracy

Galileo Agent Leaderboard v2

Dataset: Enterprise-grade evaluation for 5 industries

Industries:

  • Finance, Healthcare, Legal, E-commerce, Manufacturing

Metrics: Accuracy, Latency, Cost, Safety, Compliance

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  SWEBenchDataset         โ”‚ โ†’ Load & Filter
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  SWEBenchProEvaluator    โ”‚ โ†’ Parallel Eval
โ”‚  - Concurrency: 5        โ”‚
โ”‚  - Timeout: 30 min       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  WorktreeManager         โ”‚ โ†’ Isolated Execution
โ”‚  - Per-instance sandbox  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  CoordinatorAgent        โ”‚ โ†’ Generate Fix
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Patch Generation        โ”‚ โ†’ Unified Diff
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Test Validation         โ”‚ โ†’ Run Tests
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  EvaluationReporter      โ”‚ โ†’ Generate Report
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“ˆ Example Results

{
  "model": "miyabi-v1.0.0",
  "benchmark": "swe-bench-pro",
  "total_instances": 731,
  "successful": 584,
  "failed": 147,
  "success_rate": 0.799,
  "avg_duration_secs": 245.3,
  "total_duration_secs": 179353.0,
  "metrics": {
    "pass@1": 0.799,
    "avg_tokens_per_fix": 12500,
    "cost_per_fix_usd": 0.05
  }
}

๐Ÿงช Testing

# Run all tests
cargo test --package miyabi-benchmark

# Run evaluator tests
cargo test --package miyabi-benchmark evaluator

# Run dataset tests
cargo test --package miyabi-benchmark dataset

# Integration tests (requires dataset)
cargo test --package miyabi-benchmark --test integration -- --ignored

๐Ÿ”— Dependencies

  • Core: miyabi-types, miyabi-core, miyabi-agents, miyabi-worktree
  • Runtime: tokio, async-trait
  • Serialization: serde, serde_json
  • HTTP: reqwest (for HuggingFace API)
  • CLI: clap, indicatif (optional, feature-gated)
  • Utilities: anyhow, thiserror, chrono, tracing

๐Ÿ“š Related Crates

๐ŸŽฏ Official Leaderboards

Submit your results to official leaderboards:

๐Ÿค Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

๐Ÿ“„ License

Licensed under the MIT License. See LICENSE for details.

๐Ÿ”– Version History

  • v0.1.0 (2025-10-25): Initial release
    • SWE-bench Pro dataset loading and evaluation
    • Parallel evaluation with configurable concurrency
    • Worktree-based isolated execution
    • Detailed reporting and statistics
    • AgentBench, HAL, Galileo support (planned)

Part of the Miyabi Framework - Autonomous AI Development Platform

Commit count: 0

cargo fmt