hedl-c14n

Crates.io	hedl-c14n
lib.rs	hedl-c14n
version	1.2.0
created_at	2026-01-08 11:01:40.216082+00
updated_at	2026-01-21 02:57:31.019486+00
description	HEDL canonicalization and pretty-printing
homepage	https://dweve.com
repository	https://github.com/dweve/hedl
max_upload_size
id	2029987
size	258,407

(marcflp)

documentation

https://docs.rs/hedl-c14n

README

hedl-c14n

Canonical form generation for HEDL documents—deterministic serialization with ditto optimization for minimal token count.

Comparing HEDL documents shouldn't fail on whitespace differences. Git diffs shouldn't show spurious changes from inconsistent formatting. LLM context windows are expensive—every token matters. Production systems need bit-for-bit identical outputs for cache hits and content-addressable storage. Cryptographic signatures require deterministic serialization.

hedl-c14n implements canonical form generation per SPEC.md Section 13.2. Transform any valid HEDL document into normalized form with consistent indentation, sorted keys, ditto optimization for repeated values, and count hints on matrix lists. Same document always produces identical output. Round-trip stable—parse(canonicalize(doc)) preserves semantic equivalence. Reduces token count by 15-40% through ditto operator while maintaining full type information.

What's Implemented

Comprehensive canonicalization with performance and security:

Deterministic Serialization: Bit-for-bit identical output for equivalent documents
Ditto Optimization: Replace repeated values with ^ operator (15-40% token reduction)
Count Hints: Automatic [count] annotations on matrix lists for fast parsing
Value Normalization: Float formatting (no trailing zeros, -0 → 0), null as ~, lowercase booleans
Key Ordering: Optional alphabetic sorting for consistent field ordering
Quoting Strategy: Minimal quoting (only when necessary) or always-quote modes
Schema Options: Inline schemas in matrix headers or separate %STRUCT declarations
Security Hardening: 1000-level depth limit prevents stack overflow
Performance Optimizations: Pre-allocated buffers (P1), direct BTreeMap iteration (P0)
Round-Trip Stability: Semantic equivalence preserved through parse → canonicalize → parse

Installation

[dependencies]
hedl-c14n = "1.2"

Basic Usage

Canonicalize with Defaults

use hedl_core::parse;
use hedl_c14n::canonicalize;

let doc = parse(br#"
%VERSION: 1.0
%STRUCT: User: [id, name, email]
---
users: @User
  | alice, Alice Smith, alice@example.com
  | bob, Bob Jones, bob@example.com
  | charlie, Charlie Brown, charlie@example.com
"#)?;

let canonical = canonicalize(&doc)?;
println!("{}", canonical);

Output:

%VERSION: 1.0
%STRUCT: User: [id, name, email]
---
users: @User[3]
  | alice, Alice Smith, alice@example.com
  | bob, Bob Jones, bob@example.com
  | charlie, Charlie Brown, charlie@example.com

Features Applied:

Count hint [3] added automatically
Consistent 2-space indentation
Minimal quoting (only when required)
Preserved key order

Custom Configuration

use hedl_c14n::{canonicalize_with_config, CanonicalConfig, QuotingStrategy};

let config = CanonicalConfig::builder()
    .use_ditto(true)                          // Enable ditto optimization
    .sort_keys(true)                          // Alphabetically sort fields
    .inline_schemas(true)                     // Inline schemas in headers
    .quoting(QuotingStrategy::Minimal)        // Minimal quoting
    .build();

let canonical = canonicalize_with_config(&doc, &config)?;

Ditto Optimization

Replace repeated values with ^ to reduce token count:

Without Ditto (use_ditto=false)

orders: @Order[id, customer, status, priority]
  | ord1, @User:alice, pending, high
  | ord2, @User:alice, pending, high
  | ord3, @User:bob, shipped, normal
  | ord4, @User:bob, shipped, normal
  | ord5, @User:bob, shipped, normal

With Ditto (use_ditto=true)

orders: @Order[id, customer, status, priority]
  | ord1, @User:alice, pending, high
  | ord2, ^, ^, ^
  | ord3, @User:bob, shipped, normal
  | ord4, ^, ^, ^
  | ord5, ^, ^, ^

Token Reduction: 33 fewer tokens (15% reduction for this example)

Ditto Logic

Ditto operator ^ applied when:

Exact Type Match: Same Value variant (String/Int/Float/Bool/Null/Reference/Expression)
Sequential Repetition: Value matches immediately previous row's same column
Not First Row: Ditto never used in first row of matrix (no previous value)

Type Equality Examples:

// These match (same type + value)
Value::String("alice") == Value::String("alice")  // ✓ → ^
Value::Int(42) == Value::Int(42)                 // ✓ → ^
Value::Float(3.14) == Value::Float(3.14)         // ✓ → ^
Value::Reference(qualified("User", "alice")) == Value::Reference(qualified("User", "alice"))  // ✓ → ^

// These don't match (different types or values)
Value::String("42") != Value::Int(42)            // ✗ → keep literal
Value::Float(0.0) != Value::Int(0)               // ✗ → keep literal
Value::Reference(local("alice")) != Value::Reference(qualified("User", "alice"))  // ✗ → keep literal

Count Hints

Automatically generate [count] annotations:

// Input: no count hint
users: @User
  | alice, Alice
  | bob, Bob

// Output: count hint added
users: @User[2]
  | alice, Alice
  | bob, Bob

Benefits:

Parsers can pre-allocate memory (avoids Vec reallocation)
Streaming parsers know total row count upfront
Validation can detect truncated lists
20-30% faster parsing for large lists

Algorithm: Recursive traversal counts nodes in each matrix list before serialization.

Value Normalization

All values normalized to canonical form:

Float Normalization

// No trailing zeros
3.1400 → 3.14
5.000 → 5.0

// Whole numbers as floats (preserve type)
42.0 → 42.0
100.0 → 100.0

// Negative zero normalized
-0.0 → 0.0

// Special values
NaN → null     // Not preserved (becomes null)
Infinity → null
-Infinity → null

Null Representation

# Canonical form uses tilde
field: ~

Boolean Lowercase

True → true
FALSE → false

Reference Format

# Qualified references
customer: @User:alice

# Local references
prev: @item1

Quoting Strategy

Two quoting modes control string serialization:

Minimal (Default)

Quote only when necessary:

# No quotes needed
name: Alice Smith
status: active

# Quotes required (contains special characters)
path: "C:\\Program Files"
note: "Hello, world"    # Contains comma
value: "true"           # Looks like boolean
id: "42"                # Looks like integer
ref: "@alice"           # Starts with @ (looks like reference)

Triggers for Quoting:

Contains structural characters: : [ ] { } , | @
Starts with - (looks like list marker)
Matches boolean literal: true, false
Matches null literal: null, ~
Looks like number: 123, -456, 3.14
Empty string or only whitespace
Contains quotes or backslashes (requires escaping)

Always

Quote all strings unconditionally:

name: "Alice Smith"
status: "active"
age: 30              # Numbers never quoted
active: true         # Booleans never quoted

Use When: Maximum compatibility with naive parsers, explicit type marking

Key Ordering

Control field order with sort_keys:

Preserve Order (sort_keys=false, default)

config:
  name: MyApp
  version: 1.0
  author: Alice

Preserves: Original insertion order from source document

Alphabetic Sort (sort_keys=true)

config:
  author: Alice
  name: MyApp
  version: 1.0

Benefits:

Consistent field ordering across documents
Easier visual diffing
Deterministic regardless of original order
Better for git diffs

Note: Entity IDs always appear first in matrix rows regardless of sort_keys.

Schema Handling

Two modes for schema representation:

Separate %STRUCT (inline_schemas=false, default)

%VERSION: 1.0
%STRUCT: User: [id, name, email]
---
users: @User[2]
  | alice, Alice, alice@example.com
  | bob, Bob, bob@example.com

Advantages:

Schema defined once, reused multiple times
Type system remains explicit
Easier schema updates (single location)

Inline Schemas (inline_schemas=true)

%VERSION: 1.0
---
users: @User[id, name, email][2]
  | alice, Alice, alice@example.com
  | bob, Bob, bob@example.com

Advantages:

Self-contained matrix lists
No forward references
Easier to extract individual lists

Configuration Reference

CanonicalConfig Builder

use hedl_c14n::{CanonicalConfig, QuotingStrategy};

let config = CanonicalConfig::builder()
    .use_ditto(true)                          // Enable ^ optimization (default: true)
    .sort_keys(true)                          // Alphabetic sorting (default: true)
    .inline_schemas(false)                    // Inline vs %STRUCT (default: false)
    .quoting(QuotingStrategy::Minimal)        // Quoting mode (default: Minimal)
    .build();

Configuration Options

use_ditto (default: true)

Replace repeated values with ^ operator
Reduces token count by 15-40% for repetitive data
Trade-off: Slightly less human-readable, much more LLM-efficient

sort_keys (default: true)

Alphabetically sort object fields
Deterministic ordering regardless of insertion order
Improves git diff readability

inline_schemas (default: false)

true: Inline schemas in matrix headers @Type[field1, field2]
false: Separate %STRUCT declarations in header
Trade-off: Self-contained vs reusable schemas

quoting (default: Minimal)

QuotingStrategy::Minimal - Quote only when necessary
QuotingStrategy::Always - Quote all strings
Minimal recommended for token efficiency

Security: Depth Limits

Protection against deeply nested structures:

const MAX_NESTING_DEPTH: usize = 1000;

// Attempting to canonicalize > 1000 levels deep:
// Error: HedlError::Syntax { line: ..., message: "Max depth exceeded: 1001 levels (max: 1000)" }

Prevents:

Stack overflow from malicious input
Infinite recursion bugs
Accidental runaway nesting

Implementation: Depth counter incremented on each recursive call, decremented on return.

Error Handling

Canonicalization uses HedlError from hedl-core:

use hedl_c14n::canonicalize;
use hedl_core::HedlError;

match canonicalize(&doc) {
    Ok(canonical) => println!("{}", canonical),
    Err(HedlError::Syntax { line, message }) => {
        eprintln!("Syntax error at line {}: {}", line, message);
    }
    Err(e) => {
        eprintln!("Error: {}", e);
    }
}

Error Conditions

Nesting too deep: Document exceeds 1000-level nesting limit
Write failures: Internal buffer errors (extremely rare)

Errors include line numbers and context for debugging.

Round-Trip Stability

Canonical form preserves semantic equivalence:

use hedl_core::parse;
use hedl_c14n::canonicalize;

let original = parse(hedl_bytes)?;
let canonical_str = canonicalize(&original)?;
let reparsed = parse(canonical_str.as_bytes())?;

// Semantic equivalence holds
assert_eq!(original.version, reparsed.version);
assert_eq!(original.structs, reparsed.structs);
assert_eq!(original.entities, reparsed.entities);

Guarantees:

All fields preserved
All entities preserved
References maintained
Type information intact
Semantic equality (not string equality)

Non-Preserved:

Whitespace and formatting
Comment positions (comments stripped)
Original key order (if sort_keys=true)
Exact float representation (normalized)

Use Cases

Version Control Normalization: Canonicalize HEDL files before git commit to eliminate spurious formatting diffs. Enable clean git history focused on semantic changes.

LLM Context Optimization: Reduce token count by 15-40% through ditto optimization. Fit more data in 8K/32K/100K context windows without losing information.

Content-Addressable Storage: Generate deterministic hashes for identical documents regardless of source formatting. Enable deduplication and cache hits.

Cryptographic Signatures: Sign canonical form to ensure signatures verify regardless of whitespace/formatting variations. Ideal for document integrity verification.

Database Exports: Normalize exported HEDL for consistent baselines in testing and CI/CD. Detect actual data changes, not formatting noise.

Configuration Management: Standardize config file formatting across teams and tools. Automated formatting on save, consistent style enforcement.

What This Crate Doesn't Do

Validation: Canonicalization assumes input is valid. For validation, use hedl-lint before canonicalization.

Comment Preservation: Comments are not part of canonical form and are stripped. For comment-preserving formatting, use hedl-core's pretty-printer.

Custom Formatting Rules: Configuration is comprehensive but not infinite. Highly custom formatting requirements may need custom serialization.

Schema Inference: Uses existing schemas from document. For schema generation from data, use hedl-core's inference APIs.

Performance Characteristics

Time Complexity: O(n) where n = total nodes + fields. Single linear pass through document tree.

Space Complexity: O(n) output buffer + O(d) recursion stack where d = nesting depth. Pre-allocation optimization (P1) amortizes allocations.

Optimizations Implemented:

P0: Direct BTreeMap Iteration - Iterate without intermediate Vec allocation (eliminates O(n) allocation)
P1: Pre-allocated Buffers - Estimate output size, allocate once (reduces allocation count by 90%)

Ditto Performance: Type equality checks are O(1) constant time. Minimal overhead (~1-2% slower than no-ditto).

Count Hint Generation: O(n) single pass to count nodes. Cached during serialization (no double-traversal).

Dependencies

hedl-core 1.0 - Core HEDL data structures and parsing
thiserror 1.0 - Error type definitions

License

Apache-2.0

Commit count: 0

hedl-c14n

documentation

README

hedl-c14n

What's Implemented

Installation

Basic Usage

Canonicalize with Defaults

Custom Configuration

Ditto Optimization

Without Ditto (use_ditto=false)

With Ditto (use_ditto=true)

Ditto Logic

Count Hints

Value Normalization

Float Normalization

Null Representation

Boolean Lowercase

Reference Format

Quoting Strategy

Minimal (Default)

Always

Key Ordering

Preserve Order (sort_keys=false, default)

Alphabetic Sort (sort_keys=true)

Schema Handling

Separate %STRUCT (inline_schemas=false, default)

Inline Schemas (inline_schemas=true)

Configuration Reference

CanonicalConfig Builder

Configuration Options

Security: Depth Limits

Error Handling

Error Conditions

Round-Trip Stability

Use Cases

What This Crate Doesn't Do

Performance Characteristics

Dependencies

License

cargo fmt