| Crates.io | hedl-parquet |
| lib.rs | hedl-parquet |
| version | 1.2.0 |
| created_at | 2026-01-09 00:20:17.494705+00 |
| updated_at | 2026-01-21 03:01:54.749187+00 |
| description | HEDL to/from Apache Parquet conversion |
| homepage | https://dweve.com |
| repository | https://github.com/dweve/hedl |
| max_upload_size | |
| id | 2031329 |
| size | 624,534 |
Bidirectional HEDL ↔ Apache Parquet conversion—columnar storage for analytics workloads with compression and efficient querying.
Analytics systems need columnar storage. Parquet is the standard for big data workflows—Spark, Presto, Athena, BigQuery all consume it. But converting HEDL documents to Parquet shouldn't lose type information or structural semantics. Reading Parquet files back to HEDL for transformation and validation shouldn't require custom schemas.
hedl-parquet provides bidirectional conversion between HEDL and Apache Parquet via Arrow 57.0 integration. Export HEDL entity lists as columnar Parquet tables with automatic schema generation and type mapping. Choose compression algorithms (SNAPPY/GZIP/ZSTD/UNCOMPRESSED) for storage optimization. Import Parquet files back to HEDL with full schema preservation and metadata roundtrip. Security-hardened with decompression bomb protection and column count limits.
Comprehensive Parquet integration with Apache Arrow:
@Type:id format preserved)[dependencies]
hedl-parquet = "1.2"
Export HEDL entity list to Parquet file:
use hedl_core::parse;
use hedl_parquet::to_parquet;
use std::path::Path;
let doc = parse(br#"
%VERSION: 1.0
%STRUCT: User: [id, name, age, email, active]
---
users: @User
| alice, Alice Smith, 30, alice@example.com, true
| bob, Bob Jones, 25, bob@example.com, true
| carol, Carol White, 35, carol@example.com, false
"#)?;
to_parquet(&doc, Path::new("users.parquet"))?;
Generated Parquet:
id: Utf8, name: Utf8, age: Int64, email: Utf8, active: Booleanuse hedl_parquet::{to_parquet_with_config, ToParquetConfig};
use parquet::basic::Compression;
use std::path::Path;
let config = ToParquetConfig {
compression: Compression::ZSTD(Default::default()),
..Default::default()
};
to_parquet_with_config(&doc, Path::new("users.parquet"), &config)?;
Import Parquet file back to HEDL:
use hedl_parquet::from_parquet;
use std::path::Path;
let doc = from_parquet(Path::new("users.parquet"))?;
// Use HEDL's structured API
println!("Version: {}.{}", doc.version.0, doc.version.1);
for (key, item) in &doc.root {
println!("{}: {:?}", key, item);
}
Fast compression with moderate compression ratios:
Compression::SNAPPY
Characteristics:
When to Use: General-purpose analytics, interactive queries, when read speed matters more than storage
High compression ratios at cost of speed:
Compression::GZIP(Default::default())
Characteristics:
When to Use: Long-term storage, cost-sensitive scenarios, infrequent reads
Modern compression with excellent speed/ratio balance:
Compression::ZSTD(Default::default())
Characteristics:
When to Use: Best overall choice for new systems, Spark/Presto workloads
No compression (raw columnar storage):
Compression::UNCOMPRESSED
Characteristics:
When to Use: Data already compressed at filesystem level, in-memory processing
HEDL types map to Arrow types with full fidelity:
// HEDL Value → Arrow Type
Value::Int(42) → arrow::datatypes::DataType::Int64
Value::Float(3.14) → arrow::datatypes::DataType::Float64
Value::String("alice") → arrow::datatypes::DataType::Utf8
Value::Bool(true) → arrow::datatypes::DataType::Boolean
Value::Null → null value in nullable column
// References serialized as strings
Value::Reference(hedl_core::Reference::qualified("User", "alice")) → "@User:alice" (Utf8)
Value::Reference(hedl_core::Reference::local("item1")) → "@item1" (Utf8)
// Expressions evaluated then converted
Value::Expression("$(1+2)") → 3 (Int64)
Nullable Columns: All columns nullable by default (Arrow schema)
Type Inference: When importing Parquet without HEDL metadata, types inferred from Arrow schema
Automatic Arrow schema from HEDL %STRUCT:
%STRUCT: Product: [id, name, price, stock, discontinued]
Generated Arrow Schema:
Schema {
fields: [
Field { name: "id", data_type: Utf8, nullable: true },
Field { name: "name", data_type: Utf8, nullable: true },
Field { name: "price", data_type: Float64, nullable: true },
Field { name: "stock", data_type: Int64, nullable: true },
Field { name: "discontinued", data_type: Boolean, nullable: true },
],
metadata: {
"hedl.version": "1.0",
"hedl.struct.Product": "[id, name, price, stock, discontinued]"
}
}
Metadata Included:
use hedl_parquet::ToParquetConfig;
use parquet::basic::Compression;
let config = ToParquetConfig {
compression: Compression::SNAPPY, // Compression algorithm
enable_dictionary: true, // Dictionary encoding (default: true)
..Default::default()
};
compression (default: SNAPPY)
enable_dictionary (default: true)
writer_version (default: WriterVersion::PARQUET_2_0)
statistics (default: EnabledStatistics::Chunk)
coerce_types (default: false)
Controls how type mismatches are handled
false (recommended): Type mismatches write null
true: Type mismatches coerce to default values (0, false, "")
Prevents memory exhaustion from malicious compressed Parquet:
// Internal constant: MAX_DECOMPRESSED_SIZE = 100 MB
// Reading compressed file that decompresses to > 100 MB:
// Returns HedlError::security("Decompressed size exceeds limit...")
Protection Against:
Prevents resource exhaustion from wide schemas:
// Internal constant: MAX_COLUMNS = 1000
// Processing schema with > 1000 columns:
// Returns HedlError::security("Schema exceeds maximum column count: 1500 (max: 1000)")
Protection Against:
All integer operations checked for overflow:
// Safe arithmetic on column counts, row counts, buffer sizes
// Panics prevented, returns HedlError on overflow
Sequential processing guarantees insertion order:
users: @User
| alice, Alice # Row 0
| bob, Bob # Row 1
| carol, Carol # Row 2
Parquet Order: Row 0, 1, 2 (same as HEDL)
When Importing: Rows reconstructed in same order
Note: Row order is always preserved. The Parquet writer processes rows sequentially.
Process multiple documents to separate Parquet files:
use hedl_parquet::to_parquet;
use hedl_core::parse;
use std::path::Path;
// Write each document to separate file
for file_path in hedl_files {
let content = std::fs::read(&file_path)?;
let doc = parse(&content)?;
let output_path = file_path.with_extension("parquet");
to_parquet(&doc, &output_path)?;
}
Benefits:
All functions return Result<T, HedlError> from hedl-core:
use hedl_parquet::to_parquet;
use hedl_core::HedlError;
use std::path::Path;
match to_parquet(&doc, Path::new("output.parquet")) {
Ok(()) => println!("Export successful"),
Err(e) if matches!(e.kind, hedl_core::HedlErrorKind::Security) => {
eprintln!("Security error: {}", e);
}
Err(e) if matches!(e.kind, hedl_core::HedlErrorKind::IO) => {
eprintln!("I/O error: {}", e);
}
Err(e) => eprintln!("Error: {}", e),
}
Data Warehousing: Export HEDL entity lists to Parquet for loading into Snowflake, BigQuery, Redshift. Leverage columnar compression and query performance.
Spark Pipelines: Convert HEDL to Parquet for Apache Spark processing. Read from Spark, transform, write back to HEDL for downstream systems.
Athena Queries: Store HEDL exports as Parquet in S3, query with AWS Athena. Partition by date/type for cost-effective analytics.
ML Feature Stores: Export training data from HEDL to Parquet for ML pipelines. Read from Parquet, transform features, train models.
Long-Term Archival: Store HEDL data as compressed Parquet for archival. GZIP/ZSTD compression reduces storage costs significantly.
Data Lake Integration: Append HEDL data to Parquet-based data lakes. Maintain compatibility with ecosystem tools (Hive, Presto, Drill).
Multi-File Datasets: Writes single Parquet file per entity list. For partitioned datasets, use external partitioning logic.
Nested Structures: Flattens nested HEDL entities to flat Parquet tables. For complex nesting, use JSON export or custom schema mapping.
Streaming Writes: Buffers entire entity list before writing. For streaming, write in batches to multiple files.
Predicate Pushdown: No query optimization—reads entire file. For selective queries, use query engines (Spark, Presto) on generated Parquet.
Schema Evolution: No automatic schema migration. For evolving schemas, handle versioning externally.
Write Performance: 50-200 MB/s depending on compression (SNAPPY fastest, GZIP slowest)
Read Performance: 100-500 MB/s columnar reads (benefits from predicate/projection pushdown in query engines)
Compression Ratios:
Memory Usage: O(row_group_size * column_count) during write, O(row_group_size) during read
hedl-core 1.2 - Core HEDL implementationarrow 57.0 - Apache Arrow columnar formatparquet 57.0 - Apache Parquet file formatthiserror 1.0 - Error type definitionsApache-2.0