swarm-engine-eval

Crates.ioswarm-engine-eval
lib.rsswarm-engine-eval
version0.1.0
created_at2026-01-25 17:51:37.230268+00
updated_at2026-01-25 17:51:37.230268+00
descriptionEvaluation framework for SwarmEngine
homepage
repositoryhttps://github.com/ytknishimura/swarm-engine
max_upload_size
id2069176
size404,451
Yutaka Nishimura (ynishi)

documentation

README

swarm-engine-eval

Scenario-based evaluation framework for SwarmEngine agent swarms.

Usage

Running from CLI (Recommended)

# From project root
cargo run --package swarm-engine-ui -- eval <SCENARIO_PATH>

# Example: Troubleshooting scenario
cargo run --package swarm-engine-ui -- eval crates/swarm-engine-eval/scenarios/troubleshooting.toml

# With options
cargo run --package swarm-engine-ui -- eval crates/swarm-engine-eval/scenarios/troubleshooting.toml \
    -n 5 \          # Number of runs (default: 1)
    -v \            # Verbose output (show tick snapshots)
    --learning      # Enable learning data collection

CLI Options

Option Description
-n, --runs <N> Number of evaluation runs (default: 1)
-s, --seed <SEED> Random seed (default: 42)
-o, --output <FILE> JSON report output file
-v, --verbose Verbose output with tick snapshots
--learning Enable learning data collection
--variant <NAME> Select scenario variant
--list-variants List available variants

Scenarios

Built-in Scenarios

Located in scenarios/ directory:

Scenario Description
troubleshooting.toml Service diagnosis and recovery
code_exploration.toml Codebase exploration
search.toml Search tasks
internal_diagnosis.toml Internal system diagnosis

Scenario Format

[meta]
name = "Service Troubleshooting"
id = "user:troubleshooting:v2"
version = "2.0.0"
description = "Diagnose and fix a service outage"
tags = ["troubleshooting", "diagnosis", "ops"]

[task]
goal = "Diagnose the failing service and restart it"
expected = "Worker successfully restarts the problematic service"

[task.context]
target_service = "user-service"
worker_count = 1

[llm]
provider = "llama-server"
model = "LFM2.5-1.2B"
endpoint = "http://localhost:8080"
temperature = 0.1
timeout_ms = 30000
max_tokens = 512

[manager]
process_every_tick = false
process_interval_ticks = 5
immediate_on_escalation = true
confidence_threshold = 0.3

[[actions.actions]]
name = "CheckStatus"
description = "Check the status of services"

[[actions.actions.params]]
name = "service"
description = "Optional: specific service name to check"
required = false

[[actions.actions]]
name = "ReadLogs"
description = "Read logs for a specific service"

[[actions.actions]]
name = "Diagnose"
description = "Diagnose the root cause of issues"

[[actions.actions]]
name = "Restart"
description = "Restart a service"
category = "node_state_change"

[app_config]
tick_duration_ms = 10
max_ticks = 150

Scenario Variants

Scenarios can define variants for different configurations:

# List variants
cargo run --package swarm-engine-ui -- eval troubleshooting.toml --list-variants

# Run with variant
cargo run --package swarm-engine-ui -- eval troubleshooting.toml --variant complex

Environment Types

Type Description
troubleshooting Service troubleshooting simulation
codebase File operation environment (Read/Write/Grep/Glob)
none Empty environment (for testing)

Learning Integration

The eval system integrates with the offline learning system:

# 1. Collect learning data
cargo run --package swarm-engine-ui -- eval troubleshooting.toml -n 30 --learning

# 2. Run offline learning
cargo run --package swarm-engine-ui -- learn once troubleshooting

# 3. Next eval will use learned parameters
cargo run --package swarm-engine-ui -- eval troubleshooting.toml -n 5 -v

Assertions

Scenarios can define assertions for pass/fail criteria:

[[assertions]]
name = "minimum_success_rate"
metric = "success_rate"
op = "gte"
expected = 0.5

[[assertions]]
name = "max_ticks_limit"
metric = "total_ticks"
op = "lte"
expected = 100

Output

Eval produces:

  • Console output with progress and results
  • JSON report (with -o option)
  • Learning data (with --learning option)
  • Tick snapshots in verbose mode (with -v option)
Commit count: 0

cargo fmt