| Crates.io | data_generator |
| lib.rs | data_generator |
| version | 0.1.119 |
| created_at | 2025-10-07 09:00:13.991198+00 |
| updated_at | 2025-10-08 15:30:18.043842+00 |
| description | RDF data shapes implementation in Rust |
| homepage | https://rudof-project.github.io/rudof |
| repository | https://github.com/rudof-project/rudof |
| max_upload_size | |
| id | 1871366 |
| size | 405,749 |
A modern, configurable synthetic RDF data generator that creates realistic data conforming to ShEx or SHACL schemas.
You can use these commands to test the application. Execute them from the root folder (/home/diego/Documents/rudof/).
# Generate data from SHACL schema (auto-detected by .ttl extension)
cargo run -p data_generator -- --schema examples/simple_shacl.ttl --output shacl_data.ttl --entities 100
# Generate with specific seed for reproducible SHACL data
cargo run -p data_generator -- --schema examples/simple_shacl.ttl --output shacl_reproducible.ttl --entities 50 --seed 12345
# Generate from complex SHACL schema with more entities
cargo run -p data_generator -- --schema examples/shacl/node_shacl.ttl --output complex_shacl_data.ttl --entities 200
# Use parallel processing for large SHACL datasets
cargo run -p data_generator -- --schema examples/simple_shacl.ttl --output large_shacl_data.ttl --entities 5000 --parallel 8
# Generate data from ShEx schema (auto-detected by .shex extension)
cargo run -p data_generator -- --schema examples/simple.shex --output shex_data.ttl --entities 100
# Generate with configuration file and ShEx schema
cargo run -p data_generator -- --config data_generator/examples/simple_config.toml --schema data_generator/examples/schema.shex
# Generate with inline parameters using example ShEx schema
cargo run -p data_generator -- --schema data_generator/examples/schema.shex --output quick_shex_data.ttl --entities 100
# Generate with custom seed for reproducible ShEx results
cargo run -p data_generator -- --schema data_generator/examples/schema.shex --entities 50 --seed 12345
# Use automatic parallel configuration for medium datasets (works with both formats)
cargo run -p data_generator -- --config data_generator/examples/auto_parallel.toml --schema examples/simple_shacl.ttl
# Use high-performance parallel configuration for large datasets
cargo run -p data_generator -- --config data_generator/examples/parallel_config.toml --schema examples/simple.shex
# Show help for all options
cargo run -p data_generator -- --help
@prefix : <http://example.org/> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
:Person a sh:NodeShape ;
sh:closed true ;
sh:property [
sh:path :name ;
sh:minCount 1;
sh:maxCount 1;
sh:datatype xsd:string ;
] ;
sh:property [
sh:path :birthDate ;
sh:maxCount 1;
sh:datatype xsd:date ;
] ;
sh:property [
sh:path :enrolledIn ;
sh:node :Course ;
] .
:Course a sh:NodeShape;
sh:closed true ;
sh:property [
sh:path :name ;
sh:minCount 1;
sh:maxCount 1;
sh:datatype xsd:string ;
] .
From SHACL schema:
<http://example.org/Person-1> <http://example.org/name> "Diana Jones" ;
<http://example.org/enrolledIn> <http://example.org/Course-1> ;
<http://example.org/birthDate> "1971-03-12"^^<http://www.w3.org/2001/XMLSchema#date> ;
a <http://example.org/Person> .
<http://example.org/Course-1> <http://example.org/name> "Advanced Mathematics" ;
a <http://example.org/Course> .
From ShEx schema:
<http://example.org/Person-1> a <http://example.org/Person> ;
<http://example.org/name> "Fiona Rodriguez" .
<http://example.org/Course-1> a <http://example.org/Course> ;
<http://example.org/name> "Computer Science" .
# Copy the simple ready-to-use config
cp data_generator/examples/simple_config.toml my_config.toml
# Or copy the comprehensive example
cp data_generator/examples/config.toml my_config.toml
# For SHACL schemas (.ttl, .rdf, .nt files)
data_generator --config my_config.toml --schema your_schema.ttl
# For ShEx schemas (.shex files)
data_generator --config my_config.toml --schema your_schema.shex
# Auto-detection works - no need to specify format
data_generator --config my_config.toml --schema your_schema_file
# Generate data using configuration file (works with both ShEx and SHACL)
data_generator --config config.toml --schema schema_file
# Generate with inline parameters from SHACL schema
data_generator --schema schema.ttl --output data.ttl --entities 1000
# Generate with inline parameters from ShEx schema
data_generator --schema schema.shex --output data.ttl --entities 1000
# Generate with custom seed for reproducible results
data_generator --schema schema_file --entities 500 --seed 12345
# Use multiple threads for faster generation
data_generator --schema schema_file --entities 10000 --parallel 8
# Show help for all options
data_generator --help
See examples/config.toml for configuration options.
# Basic data generation settings
[generation]
entity_count = 1000 # Number of entities to generate
seed = 12345 # Random seed for reproducible results
entity_distribution = "Equal" # How to distribute entities across shapes
cardinality_strategy = "Balanced" # How to handle cardinalities
# Field generation settings
[field_generators.default]
locale = "en" # Locale for generated text
quality = "Medium" # Data quality level
# Output configuration
[output]
path = "generated_data.ttl" # Output file path
format = "Turtle" # Output format
compress = false # Whether to compress output
write_stats = true # Write generation statistics
# Parallel processing
[parallel]
worker_threads = 4 # Number of worker threads
batch_size = 100 # Entity batch size
parallel_shapes = true # Process shapes in parallel
parallel_fields = true # Generate fields in parallel
# Advanced configuration with custom field generators
[generation]
entity_count = 5000
seed = 98765
entity_distribution = "Weighted"
cardinality_strategy = "Random"
# Weighted distribution for different shape types
[generation.distribution_weights]
"http://example.org/Person" = 0.5 # 50% persons
"http://example.org/Organization" = 0.3 # 30% organizations
"http://example.org/Course" = 0.2 # 20% courses
[field_generators.default]
locale = "en"
quality = "High"
# Custom integer generation with specific ranges
[field_generators.datatypes."http://www.w3.org/2001/XMLSchema#integer"]
generator = "integer"
[field_generators.datatypes."http://www.w3.org/2001/XMLSchema#integer".parameters]
min = 1
max = 10000
# Custom decimal generation
[field_generators.datatypes."http://www.w3.org/2001/XMLSchema#decimal"]
generator = "decimal"
[field_generators.datatypes."http://www.w3.org/2001/XMLSchema#decimal".parameters]
min = 0.0
max = 1000.0
precision = 2
# Custom date generation
[field_generators.datatypes."http://www.w3.org/2001/XMLSchema#date"]
generator = "date"
[field_generators.datatypes."http://www.w3.org/2001/XMLSchema#date".parameters]
start_year = 1980
end_year = 2024
# Property-specific generators
[field_generators.properties."http://example.org/name"]
generator = "string"
parameters = {}
[field_generators.properties."http://example.org/email"]
generator = "string"
[field_generators.properties."http://example.org/email".parameters]
templates = [
"{firstName}.{lastName}@{domain}",
"{firstName}{lastName}{number}@{domain}",
"info@{domain}",
"contact@{domain}"
]
[field_generators.properties."http://example.org/legalName"]
generator = "string"
parameters = {}
# Output with compression
[output]
path = "large_dataset.ttl.gz"
format = "Turtle"
compress = true
write_stats = true
# High-performance parallel settings
[parallel]
worker_threads = 8
batch_size = 250
parallel_shapes = true
parallel_fields = true
# Minimal configuration - uses defaults for most settings
[generation]
entity_count = 100
[output]
path = "simple_data.ttl"
[generation]
entity_count = 2000
entity_distribution = "Custom"
# Exact entity counts per shape
[generation.custom_counts]
"http://example.org/Person" = 1000
"http://example.org/Organization" = 200
"http://example.org/Course" = 800
[output]
path = "custom_distribution.ttl"
# Use TOML configuration with any schema format
data_generator --config config.toml --schema schema_file
# Use JSON configuration with SHACL schema
data_generator --config config.json --schema schema.ttl
# Use JSON configuration with ShEx schema
data_generator --config config.json --schema schema.shex
# Override config with command line (works with both formats)
data_generator --config config.toml --schema schema_file --entities 5000 --output override.ttl
The data generator supports parallel writing to multiple files for improved I/O performance. The system can automatically detect the optimal number of files based on your dataset size and system capabilities.
Set parallel_file_count = 0 to enable automatic detection:
# Small dataset (50 entities) → automatically uses 1 file
cargo run --bin data_generator -- -c examples/small_auto.toml -s examples/schema_file
# Medium dataset (1000 entities) → automatically uses 8 files
cargo run --bin data_generator -- -c examples/auto_parallel.toml -s examples/schema_file
# Large dataset (5000 entities) → automatically uses 16 files
cargo run --bin data_generator -- -c examples/large_auto.toml -s examples/schema_file
[output]
path = "dataset.ttl"
format = "Turtle"
parallel_writing = true # Enable parallel writing
parallel_file_count = 8 # Write to 8 parallel files (manual setting)
[output]
path = "dataset.ttl"
format = "Turtle"
parallel_writing = true # Enable parallel writing
parallel_file_count = 0 # 0 = auto-detect optimal count
Auto-detection algorithm:
Output files:
dataset_part_001.ttl, dataset_part_002.ttl, etc.dataset.manifest.txt (lists all parallel files)dataset.stats.json (combined statistics)Performance benefits:
{
"generation": {
"entity_count": 1000,
"seed": 12345,
"entity_distribution": "Equal",
"cardinality_strategy": "Balanced"
},
"field_generators": {
"default": {
"locale": "en",
"quality": "Medium"
},
"datatypes": {
"http://www.w3.org/2001/XMLSchema#integer": {
"generator": "integer",
"parameters": {
"min": 1,
"max": 10000
}
},
"http://www.w3.org/2001/XMLSchema#string": {
"generator": "string",
"parameters": {}
}
},
"properties": {
"http://example.org/name": {
"generator": "string",
"parameters": {}
}
}
},
"output": {
"path": "generated_data.ttl",
"format": "Turtle",
"compress": false,
"write_stats": true
},
"parallel": {
"worker_threads": 4,
"batch_size": 100,
"parallel_shapes": true,
"parallel_fields": true
}
}
entity_count: Total number of entities to generateseed: Random seed for reproducible results (optional)entity_distribution: How to distribute entities across shapes
"Equal": Equal distribution across all shapes"Weighted": Use weights to control distribution"Custom": Specify exact counts per shapecardinality_strategy: How to handle property cardinalities
"Minimum": Use minimum cardinality values"Maximum": Use maximum cardinality values"Random": Random values within cardinality range"Balanced": Deterministic but varied distributionlocale: Language/locale for generated text ("en", "es", "fr")quality: Data quality level ("Low", "Medium", "High")datatypes: Custom generators for specific XSD datatypesproperties: Custom generators for specific propertiespath: Output file pathformat: Output format ("Turtle", "NTriples", "JSONLD", "RdfXml")compress: Whether to compress output filewrite_stats: Include generation statisticsparallel_writing: Enable writing to multiple parallel files for better I/O performanceparallel_file_count: Number of parallel files (0 = auto-detect optimal count)worker_threads: Number of parallel worker threadsbatch_size: Entity batch size for processingparallel_shapes: Process different shapes in parallelparallel_fields: Generate field values in parallelseed value for reproducible results during developmentworker_threads for large datasetsparallel_writing = true and parallel_file_count = 0 for automatic optimizationWhen you run the generator with write_stats = true, you'll get:
generated_data.ttl): The actual RDF data in your chosen formatgenerated_data.stats.json): Generation statistics including:
Example statistics:
{
"total_triples": 15248,
"generation_time": "497ms",
"shape_counts": {
"http://example.org/Person": 334,
"http://example.org/Organization": 333,
"http://example.org/Course": 333
}
}
The generator is built with a modular, functional architecture:
config/: Configuration management and validationfield_generators/: Composable field value generatorsshape_processing/: ShEx schema parsing and analysisparallel_generation/: Parallel data generation engineoutput/: Multiple format output writers