verificar

Crates.ioverificar
lib.rsverificar
version0.5.0
created_at2025-11-25 17:04:02.111641+00
updated_at2025-11-30 21:37:57.275885+00
descriptionSynthetic Data Factory for Domain-Specific Code Intelligence
homepage
repositoryhttps://github.com/paiml/verificar
max_upload_size
id1950090
size1,477,234
Noah Gift (noahgift)

documentation

https://docs.rs/verificar

README

verificar

Crates.io Documentation License: MIT

Synthetic Data Factory for Domain-Specific Code Intelligence

Verificar is a unified combinatorial test generation and synthetic data factory for PAIML transpiler projects (depyler, bashrs, ruchy, decy). It generates verified (source, target, correctness) tuples at scale, creating training data for domain-specific code intelligence models.

Features

  • Multi-Language Support: Generate test programs in Python, Bash, C, TypeScript, and Ruchy
  • Combinatorial Generation: Exhaustive enumeration of valid programs up to configurable depth
  • Mutation Testing: AST-level mutation operators (AOR, ROR, LOR, BSR, etc.)
  • Verification Oracle: Sandboxed execution with I/O diffing for correctness verification
  • ML Pipeline: Bug prediction models and embeddings for code intelligence
  • Parquet Output: Efficient columnar storage for large-scale data processing

Architecture

┌─────────────────────────────────────────────────────────────┐
│                       VERIFICAR CORE                        │
├─────────────────────────────────────────────────────────────┤
│  Grammar    →   Generator   →   Mutator   →   Oracle       │
│  Definitions    Engine         Engine         Verification  │
└─────────────────────────────────────────────────────────────┘

Installation

Add to your Cargo.toml:

[dependencies]
verificar = "0.3"

Or with optional features:

[dependencies]
verificar = { version = "0.3", features = ["parquet", "ml"] }

Quick Start

Library Usage

use verificar::generator::{Generator, SamplingStrategy};
use verificar::Language;

// Create a generator for Python
let generator = Generator::new(Language::Python);

// Generate test cases using coverage-guided sampling
let strategy = SamplingStrategy::CoverageGuided {
    coverage_map: None,
    max_depth: 3,
    seed: 42,
};
let test_cases = generator.generate(strategy, 100);

CLI Usage

# Generate Python test programs
verificar generate --language python --count 1000 --output corpus.json

# Generate with specific sampling strategy
verificar generate --language bash --strategy swarm --count 500

# Generate depyler-specific patterns
verificar depyler --category file_io --count 100 --output depyler_tests/

Supported Languages

Language Grammar Description
Python PythonGrammar Functions, control flow, type hints (depyler source)
Bash BashGrammar Variables, pipes, conditionals (bashrs source)
C CGrammar Functions, pointers, memory operations (decy source)
TypeScript TypeScriptGrammar Interfaces, generics, type annotations (decy target)
Ruchy RuchyGrammar Custom DSL programs
Rust - Common target language

Sampling Strategies

  • Exhaustive: Enumerate all programs up to depth N
  • CoverageGuided: Prioritize unexplored AST paths (NAUTILUS-style)
  • Swarm: Random feature subsets per batch
  • Boundary: Edge values emphasized (0, -1, MAX_INT, empty collections)

Generation Priority

Based on organizational intelligence analysis of 1,296 defect-fix commits:

Priority Category Allocation Rationale
P0 ASTTransform 50% Universal dominant defect (40-62%)
P1 OwnershipBorrow 20% Rust-specific (15-20%)
P2 StdlibMapping 15% API translation errors
P3 Language-specific 15% bashrs security, decy memory, etc.

Features

Feature Description
parquet Enable Parquet data output
ml Enable ML pipeline (aprender integration)
tree-sitter Use tree-sitter for grammar parsing
pest Use pest for PEG grammars
full Enable all features

License

MIT License - see LICENSE for details.

Contributing

Contributions welcome! Please read the CLAUDE.md for development guidelines.

Commit count: 0

cargo fmt