| Crates.io | hedl-c14n |
| lib.rs | hedl-c14n |
| version | 1.2.0 |
| created_at | 2026-01-08 11:01:40.216082+00 |
| updated_at | 2026-01-21 02:57:31.019486+00 |
| description | HEDL canonicalization and pretty-printing |
| homepage | https://dweve.com |
| repository | https://github.com/dweve/hedl |
| max_upload_size | |
| id | 2029987 |
| size | 258,407 |
Canonical form generation for HEDL documents—deterministic serialization with ditto optimization for minimal token count.
Comparing HEDL documents shouldn't fail on whitespace differences. Git diffs shouldn't show spurious changes from inconsistent formatting. LLM context windows are expensive—every token matters. Production systems need bit-for-bit identical outputs for cache hits and content-addressable storage. Cryptographic signatures require deterministic serialization.
hedl-c14n implements canonical form generation per SPEC.md Section 13.2. Transform any valid HEDL document into normalized form with consistent indentation, sorted keys, ditto optimization for repeated values, and count hints on matrix lists. Same document always produces identical output. Round-trip stable—parse(canonicalize(doc)) preserves semantic equivalence. Reduces token count by 15-40% through ditto operator while maintaining full type information.
Comprehensive canonicalization with performance and security:
^ operator (15-40% token reduction)[count] annotations on matrix lists for fast parsing~, lowercase booleans[dependencies]
hedl-c14n = "1.2"
use hedl_core::parse;
use hedl_c14n::canonicalize;
let doc = parse(br#"
%VERSION: 1.0
%STRUCT: User: [id, name, email]
---
users: @User
| alice, Alice Smith, alice@example.com
| bob, Bob Jones, bob@example.com
| charlie, Charlie Brown, charlie@example.com
"#)?;
let canonical = canonicalize(&doc)?;
println!("{}", canonical);
Output:
%VERSION: 1.0
%STRUCT: User: [id, name, email]
---
users: @User[3]
| alice, Alice Smith, alice@example.com
| bob, Bob Jones, bob@example.com
| charlie, Charlie Brown, charlie@example.com
Features Applied:
[3] added automaticallyuse hedl_c14n::{canonicalize_with_config, CanonicalConfig, QuotingStrategy};
let config = CanonicalConfig::builder()
.use_ditto(true) // Enable ditto optimization
.sort_keys(true) // Alphabetically sort fields
.inline_schemas(true) // Inline schemas in headers
.quoting(QuotingStrategy::Minimal) // Minimal quoting
.build();
let canonical = canonicalize_with_config(&doc, &config)?;
Replace repeated values with ^ to reduce token count:
orders: @Order[id, customer, status, priority]
| ord1, @User:alice, pending, high
| ord2, @User:alice, pending, high
| ord3, @User:bob, shipped, normal
| ord4, @User:bob, shipped, normal
| ord5, @User:bob, shipped, normal
orders: @Order[id, customer, status, priority]
| ord1, @User:alice, pending, high
| ord2, ^, ^, ^
| ord3, @User:bob, shipped, normal
| ord4, ^, ^, ^
| ord5, ^, ^, ^
Token Reduction: 33 fewer tokens (15% reduction for this example)
Ditto operator ^ applied when:
Type Equality Examples:
// These match (same type + value)
Value::String("alice") == Value::String("alice") // ✓ → ^
Value::Int(42) == Value::Int(42) // ✓ → ^
Value::Float(3.14) == Value::Float(3.14) // ✓ → ^
Value::Reference(qualified("User", "alice")) == Value::Reference(qualified("User", "alice")) // ✓ → ^
// These don't match (different types or values)
Value::String("42") != Value::Int(42) // ✗ → keep literal
Value::Float(0.0) != Value::Int(0) // ✗ → keep literal
Value::Reference(local("alice")) != Value::Reference(qualified("User", "alice")) // ✗ → keep literal
Automatically generate [count] annotations:
// Input: no count hint
users: @User
| alice, Alice
| bob, Bob
// Output: count hint added
users: @User[2]
| alice, Alice
| bob, Bob
Benefits:
Algorithm: Recursive traversal counts nodes in each matrix list before serialization.
All values normalized to canonical form:
// No trailing zeros
3.1400 → 3.14
5.000 → 5.0
// Whole numbers as floats (preserve type)
42.0 → 42.0
100.0 → 100.0
// Negative zero normalized
-0.0 → 0.0
// Special values
NaN → null // Not preserved (becomes null)
Infinity → null
-Infinity → null
# Canonical form uses tilde
field: ~
True → true
FALSE → false
# Qualified references
customer: @User:alice
# Local references
prev: @item1
Two quoting modes control string serialization:
Quote only when necessary:
# No quotes needed
name: Alice Smith
status: active
# Quotes required (contains special characters)
path: "C:\\Program Files"
note: "Hello, world" # Contains comma
value: "true" # Looks like boolean
id: "42" # Looks like integer
ref: "@alice" # Starts with @ (looks like reference)
Triggers for Quoting:
: [ ] { } , | @- (looks like list marker)true, falsenull, ~123, -456, 3.14Quote all strings unconditionally:
name: "Alice Smith"
status: "active"
age: 30 # Numbers never quoted
active: true # Booleans never quoted
Use When: Maximum compatibility with naive parsers, explicit type marking
Control field order with sort_keys:
config:
name: MyApp
version: 1.0
author: Alice
Preserves: Original insertion order from source document
config:
author: Alice
name: MyApp
version: 1.0
Benefits:
Note: Entity IDs always appear first in matrix rows regardless of sort_keys.
Two modes for schema representation:
%VERSION: 1.0
%STRUCT: User: [id, name, email]
---
users: @User[2]
| alice, Alice, alice@example.com
| bob, Bob, bob@example.com
Advantages:
%VERSION: 1.0
---
users: @User[id, name, email][2]
| alice, Alice, alice@example.com
| bob, Bob, bob@example.com
Advantages:
use hedl_c14n::{CanonicalConfig, QuotingStrategy};
let config = CanonicalConfig::builder()
.use_ditto(true) // Enable ^ optimization (default: true)
.sort_keys(true) // Alphabetic sorting (default: true)
.inline_schemas(false) // Inline vs %STRUCT (default: false)
.quoting(QuotingStrategy::Minimal) // Quoting mode (default: Minimal)
.build();
use_ditto (default: true)
^ operatorsort_keys (default: true)
inline_schemas (default: false)
true: Inline schemas in matrix headers @Type[field1, field2]false: Separate %STRUCT declarations in headerquoting (default: Minimal)
QuotingStrategy::Minimal - Quote only when necessaryQuotingStrategy::Always - Quote all stringsProtection against deeply nested structures:
const MAX_NESTING_DEPTH: usize = 1000;
// Attempting to canonicalize > 1000 levels deep:
// Error: HedlError::Syntax { line: ..., message: "Max depth exceeded: 1001 levels (max: 1000)" }
Prevents:
Implementation: Depth counter incremented on each recursive call, decremented on return.
Canonicalization uses HedlError from hedl-core:
use hedl_c14n::canonicalize;
use hedl_core::HedlError;
match canonicalize(&doc) {
Ok(canonical) => println!("{}", canonical),
Err(HedlError::Syntax { line, message }) => {
eprintln!("Syntax error at line {}: {}", line, message);
}
Err(e) => {
eprintln!("Error: {}", e);
}
}
Errors include line numbers and context for debugging.
Canonical form preserves semantic equivalence:
use hedl_core::parse;
use hedl_c14n::canonicalize;
let original = parse(hedl_bytes)?;
let canonical_str = canonicalize(&original)?;
let reparsed = parse(canonical_str.as_bytes())?;
// Semantic equivalence holds
assert_eq!(original.version, reparsed.version);
assert_eq!(original.structs, reparsed.structs);
assert_eq!(original.entities, reparsed.entities);
Guarantees:
Non-Preserved:
Version Control Normalization: Canonicalize HEDL files before git commit to eliminate spurious formatting diffs. Enable clean git history focused on semantic changes.
LLM Context Optimization: Reduce token count by 15-40% through ditto optimization. Fit more data in 8K/32K/100K context windows without losing information.
Content-Addressable Storage: Generate deterministic hashes for identical documents regardless of source formatting. Enable deduplication and cache hits.
Cryptographic Signatures: Sign canonical form to ensure signatures verify regardless of whitespace/formatting variations. Ideal for document integrity verification.
Database Exports: Normalize exported HEDL for consistent baselines in testing and CI/CD. Detect actual data changes, not formatting noise.
Configuration Management: Standardize config file formatting across teams and tools. Automated formatting on save, consistent style enforcement.
Validation: Canonicalization assumes input is valid. For validation, use hedl-lint before canonicalization.
Comment Preservation: Comments are not part of canonical form and are stripped. For comment-preserving formatting, use hedl-core's pretty-printer.
Custom Formatting Rules: Configuration is comprehensive but not infinite. Highly custom formatting requirements may need custom serialization.
Schema Inference: Uses existing schemas from document. For schema generation from data, use hedl-core's inference APIs.
Time Complexity: O(n) where n = total nodes + fields. Single linear pass through document tree.
Space Complexity: O(n) output buffer + O(d) recursion stack where d = nesting depth. Pre-allocation optimization (P1) amortizes allocations.
Optimizations Implemented:
Ditto Performance: Type equality checks are O(1) constant time. Minimal overhead (~1-2% slower than no-ditto).
Count Hint Generation: O(n) single pass to count nodes. Cached during serialization (no double-traversal).
hedl-core 1.0 - Core HEDL data structures and parsingthiserror 1.0 - Error type definitionsApache-2.0