| Crates.io | hedl-stream |
| lib.rs | hedl-stream |
| version | 1.2.0 |
| created_at | 2026-01-08 11:02:03.147755+00 |
| updated_at | 2026-01-21 03:01:01.425322+00 |
| description | Streaming parser for HEDL - memory-efficient processing of large files |
| homepage | https://dweve.com |
| repository | https://github.com/dweve/hedl |
| max_upload_size | |
| id | 2029988 |
| size | 544,903 |
Memory-efficient streaming parser for HEDL documents—process multi-gigabyte files with constant memory usage.
Large HEDL files don't fit in RAM. Database exports, log archives, data pipelines—gigabytes of structured data that need processing without loading everything into memory. Traditional parsing loads the entire document first, then gives you access. That doesn't scale.
hedl-stream provides event-driven streaming parsing with O(1) memory regardless of file size. Process 10 GB files with 100 MB RAM. Iterate through nodes as they're parsed. Build custom processing pipelines with standard Rust iterators. Add timeout protection for untrusted input. Optional async support for high-concurrency scenarios.
Production-grade streaming with comprehensive features:
[dependencies]
hedl-stream = "1.2"
# For async support:
hedl-stream = { version = "1.2", features = ["async"] }
tokio = { version = "1", features = ["io-util"] }
Process large HEDL files with constant memory:
use hedl_stream::{StreamingParser, NodeEvent};
use std::fs::File;
// Open large HEDL file (e.g., 5 GB database export)
let file = File::open("massive_data.hedl")?;
let parser = StreamingParser::new(file)?;
let mut node_count = 0;
for event in parser {
match event? {
NodeEvent::Header(header) => {
println!("Version: {}.{}", header.version.0, header.version.1);
println!("Schemas: {:?}", header.structs.keys());
}
NodeEvent::ListStart { key, type_name, schema, .. } => {
println!("Processing list '{}' of type {}", key, type_name);
}
NodeEvent::Node(node) => {
node_count += 1;
// Process individual node
// Memory usage stays constant regardless of total nodes
}
NodeEvent::ListEnd { key, count, .. } => {
println!("Completed list '{}': {} nodes", key, count);
}
NodeEvent::EndOfDocument => break,
_ => {}
}
}
println!("Processed {} nodes from multi-GB file", node_count);
Memory Usage: Only the current line and context stack are in memory. A 5 GB file uses the same memory as a 5 MB file.
Fine-tune parsing with StreamingParserConfig:
use hedl_stream::{StreamingParser, StreamingParserConfig, MemoryLimits};
use std::time::Duration;
use std::fs::File;
let config = StreamingParserConfig {
max_line_length: 500_000, // 500 KB max line (default: 1 MB)
max_indent_depth: 50, // 50 levels max nesting (default: 100)
buffer_size: 128 * 1024, // 128 KB I/O buffer (default: 64 KB)
timeout: Some(Duration::from_secs(30)), // 30-second timeout (default: None)
memory_limits: MemoryLimits::default(), // Default memory limits
enable_pooling: false, // Disable buffer pooling (default: false)
};
let file = File::open("untrusted_input.hedl")?;
let parser = StreamingParser::with_config(file, config)?;
// Parsing will error if it exceeds 30 seconds (DoS protection)
max_line_length (default: 1,000,000 bytes)
Syntax error if exceededmax_indent_depth (default: 100 levels)
Syntax error if exceededbuffer_size (default: 65,536 bytes)
timeout (default: None)
Timeout error if exceededmemory_limits (default: MemoryLimits::default())
MemoryLimits::default() for normal operationMemoryLimits::untrusted() for stricter limits on untrusted inputMemoryLimits documentation for detailed configurationenable_pooling (default: false)
memory_limits.enable_buffer_pooling to also be trueHeader(HeaderInfo)
ListStart { key, type_name, schema, line }
key: Field name (e.g., "users")type_name: Entity type (e.g., "User")schema: Column names for matrixline: Source line numberNode(NodeInfo)
ListEnd { key, type_name, count }
count: Total nodes in listScalar { key, value, line }
name: "My App"ObjectStart { key, line } / ObjectEnd { key }
EndOfDocument
pub struct NodeInfo {
pub type_name: String, // Entity type (e.g., "User")
pub id: String, // Entity ID (first field)
pub fields: Vec<Value>, // All field values including ID
pub depth: usize, // Nesting depth (0 = root)
pub parent_id: Option<String>, // Parent entity ID if nested
pub parent_type: Option<String>, // Parent entity type if nested
pub line: usize, // Source line number
pub child_count: Option<usize>, // Expected child count from |[N] syntax
}
Methods:
get_field(index) -> Option<&Value> - Get field by column indexis_nested() -> bool - Check if node has parentpub struct HeaderInfo {
pub version: (u32, u32), // Major.Minor
pub structs: BTreeMap<String, Vec<String>>, // Type schemas
pub aliases: BTreeMap<String, String>, // Variable aliases
pub nests: BTreeMap<String, String>, // Parent->Child rules
}
Methods:
get_schema(type_name) -> Option<&Vec<String>> - Lookup type schemaget_child_type(parent_type) -> Option<&String> - Get child type for parentFor high-concurrency scenarios with thousands of concurrent streams:
use hedl_stream::AsyncStreamingParser;
use tokio::fs::File;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let file = File::open("large_data.hedl").await?;
let mut parser = AsyncStreamingParser::new(file).await?;
let mut count = 0;
loop {
match parser.next_event().await? {
Some(NodeEvent::Node(_)) => count += 1,
Some(NodeEvent::EndOfDocument) => break,
Some(_) => {}
None => break,
}
}
println!("Processed {} nodes", count);
Ok(())
}
Performance: Async version has identical memory profile to sync version. Suitable for processing thousands of files concurrently without blocking.
CSV-like comma-separated rows prefixed with |:
users: @User[id, name, age, active]
| alice, Alice Smith, 30, true
| bob, Bob Jones, 25, false
| carol, Carol White, 35, true
Features:
| id1, "Smith, John", 30| id2, "Quote: \"value\"", 25Repeat previous value with ^:
orders: @Order[id, customer, status]
| ord1, @User:alice, pending
| ord2, ^, shipped # customer = @User:alice (from previous row)
| ord3, @User:bob, pending
| ord4, ^, ^ # customer = @User:bob, status = pending
Automatic detection of entity references:
# Qualified reference
customer: @User:alice # Reference(qualified("User", "alice"))
# Local reference
parent: @previous_item # Reference(local("previous_item"))
Variable substitution with $:
%ALIAS: api_url: https://api.example.com
---
config:
endpoint: $api_url # Substituted to "https://api.example.com"
Full-line and inline comments:
# This is a full-line comment
users: @User[id, name]
| alice, Alice # This is an inline comment
| bob, "Bob # Not a comment (inside quotes)"
SIMD Optimization: When compiled with AVX2 support (x86_64), comment detection uses 32-byte SIMD scanning for 2-3x speedup on comment-heavy files.
Comprehensive error types with line numbers:
use hedl_stream::{StreamingParser, StreamError};
for event in parser {
match event {
Ok(event) => { /* process */ }
Err(StreamError::Syntax { line, message }) => {
eprintln!("Syntax error at line {}: {}", line, message);
}
Err(StreamError::ShapeMismatch { line, expected, got }) => {
eprintln!("Line {}: expected {} columns, got {}", line, expected, got);
}
Err(StreamError::OrphanRow { line, message }) => {
eprintln!("Line {}: orphan row - {}", line, message);
}
Err(StreamError::Timeout { elapsed, limit }) => {
eprintln!("Parsing timeout: {:?} exceeded limit {:?}", elapsed, limit);
}
Err(e) => {
eprintln!("Other error: {}", e);
}
}
}
Io(std::io::Error) - I/O read failuresUtf8 { line, message } - Invalid UTF-8 encodingSyntax { line, message } - Parse syntax errorSchema { line, message } - Schema/type mismatchHeader(String) - Invalid header formatMissingVersion - No %VERSION directiveInvalidVersion(String) - Malformed version stringOrphanRow { line, message } - Child row without parent entityShapeMismatch { line, expected, got } - Column count doesn't match schemaTimeout { elapsed, limit } - Parsing exceeded timeout durationLineTooLong { line, length, limit } - Line exceeds max_line_length configurationInvalidUtf8 { line, error } - Invalid UTF-8 with detailed error informationDatabase Export Processing: Stream multi-GB database exports, transform data row-by-row, write to different format without loading entire export into memory.
Log File Analysis: Parse massive HEDL log archives, filter events, aggregate statistics, generate reports—all with constant memory usage.
Data Pipeline Integration: Read HEDL from network streams, process incrementally, forward to downstream systems without buffering.
ETL Workflows: Extract from large HEDL files, transform with custom logic, load to database with batch inserts—process millions of rows efficiently.
Real-Time Processing: Parse HEDL data as it arrives (stdin, network socket), emit events immediately, support backpressure naturally.
Untrusted Input Validation: Parse user-uploaded HEDL with timeout protection, validate structure, reject malicious input before full processing.
Full Document Construction: Doesn't build complete Document objects—that's hedl-core's job. For full document parsing, use hedl_core::parse(). Use streaming when you need memory efficiency.
Random Access: Sequential-only parser. Can't jump to arbitrary positions. For random access, load full document with hedl-core.
Modification: Read-only parser. Can't modify nodes during parsing. For transformations, consume events and write new HEDL output.
Validation: Parses structure, doesn't validate business rules. For schema validation, use hedl-lint on parsed documents.
Memory: O(nesting_depth) regardless of file size. Typically <1 MB for files of any size with reasonable nesting.
I/O: Configurable buffer size (default 64 KB) minimizes syscalls. Batched reads for optimal throughput.
Parsing: Linear pass through input. SIMD-accelerated comment detection (AVX2, ~2-3x faster for comment-heavy files).
Timeout Checks: Every 100 operations (~0.1% overhead). Negligible impact on normal workloads.
Async: Same memory profile as sync. Non-blocking I/O yields to runtime during reads. Suitable for thousands of concurrent streams.
hedl-core (workspace) - Core types (Value, Reference), lexer utilitiesthiserror 1.0 - Error type definitionstokio 1.35 (optional, "async" feature) - Async I/O runtimeApache-2.0