| Crates.io | parquet-record |
| lib.rs | parquet-record |
| version | 0.2.0 |
| created_at | 2025-12-09 01:34:23.184033+00 |
| updated_at | 2025-12-09 23:17:32.092337+00 |
| description | High-performance Rust library for moving structs to/from disk using Parquet format. Abstracts complex Arrow/Parquet usage while providing batch writing and parallel reading capabilities for maximum performance. |
| homepage | |
| repository | |
| max_upload_size | |
| id | 1974673 |
| size | 102,314 |
A high-performance Rust library for moving structs to/from disk using Parquet format. Abstracts complex Arrow/Parquet usage while providing batch writing and parallel reading capabilities for maximum performance.
Add this to your Cargo.toml:
[dependencies]
parquet-record = "0.1.0" # Replace with the actual version
use parquet_record::{ParquetRecord, ParquetBatchWriter};
use arrow::array::{Int32Array, StringArray};
use arrow::datatypes::{Schema, Field, DataType};
use arrow::record_batch::RecordBatch;
use std::sync::Arc;
use std::io;
// Define a struct that implements ParquetRecord
#[derive(Debug, Clone)]
struct MyRecord {
id: i32,
name: String,
}
impl ParquetRecord for MyRecord {
fn schema() -> Arc<Schema> {
Arc::new(Schema::new(vec![
Field::new("id", DataType::Int32, false),
Field::new("name", DataType::Utf8, false),
]))
}
fn items_to_records(schema: Arc<Schema>, items: &[Self]) -> RecordBatch {
let ids: Vec<i32> = items.iter().map(|item| item.id).collect();
let names: Vec<&str> = items.iter().map(|item| item.name.as_str()).collect();
RecordBatch::try_new(
schema,
vec![
Arc::new(Int32Array::from(ids)),
Arc::new(StringArray::from(names)),
],
).unwrap()
}
fn records_to_items(record_batch: &RecordBatch) -> io::Result<Vec<Self>> {
use arrow::array::{as_primitive_array, StringArray};
let id_array = as_primitive_array::<arrow::datatypes::Int32Type>(record_batch.column(0));
let name_array = record_batch
.column(1)
.as_any()
.downcast_ref::<StringArray>()
.unwrap();
let mut result = Vec::new();
for i in 0..record_batch.num_rows() {
result.push(MyRecord {
id: id_array.value(i),
name: name_array.value(i).to_string(),
});
}
Ok(result)
}
}
// Writing to a Parquet file
let writer = ParquetBatchWriter::<MyRecord>::new("output.parquet".to_string(), Some(100));
let records = vec![
MyRecord { id: 1, name: "Alice".to_string() },
MyRecord { id: 2, name: "Bob".to_string() },
MyRecord { id: 3, name: "Charlie".to_string() },
];
writer.add_items(records).unwrap();
writer.close().unwrap();
// Reading from a Parquet file
if let Some((total_rows, reader)) = parquet_record::read_parquet(
MyRecord::schema(),
"output.parquet",
Some(50)
) {
let all_records: Vec<MyRecord> = reader.flat_map(|batch| batch).collect();
println!("Read {} records from {} rows", all_records.len(), total_rows);
}
// Read in parallel across row groups
if let Some((total_rows, par_reader)) = parquet_record::read_parquet_par::<MyRecord>(
MyRecord::schema(),
"output.parquet",
Some(100)
) {
let all_records: Vec<MyRecord> = par_reader.flat_map(|batch| batch).collect();
println!("Read {} records in parallel", all_records.len());
}
use arrow::datatypes::Int32Type;
// Read only the "id" column
if let Some((total_rows, id_reader)) =
parquet_record::read_parquet_columns::<Int32Type>(
"output.parquet",
"id",
Some(100)
) {
let all_ids: Vec<i32> = id_reader.flat_map(|batch| batch).collect();
println!("Read {} IDs", all_ids.len());
}
use parquet_record::ParquetRecordConfig;
let config = ParquetRecordConfig::with_verbose(false);
let writer = ParquetBatchWriter::<MyRecord>::with_config(
"output.parquet".to_string(),
Some(100),
config
);
ParquetRecord TraitThis trait must be implemented for any struct that you want to serialize to or deserialize from Parquet files.
Required Methods:
schema() -> Arc<Schema>: Returns the Arrow schema for this record type.
items_to_records(schema: Arc<Schema>, items: &[Self]) -> RecordBatch: Converts a slice of items to an Arrow RecordBatch.
records_to_items(record_batch: &RecordBatch) -> io::Result<Vec<Self>>: Converts an Arrow RecordBatch back to a vector of items.
ParquetRecordConfig StructConfiguration options for parquet operations.
Methods:
with_verbose(verbose: bool) -> Self: Creates a configuration with specified verbose setting.
silent() -> Self: Creates a configuration with verbose output disabled.
ParquetBatchWriter<T> StructA thread-safe batch writer for writing records to Parquet files with buffering and automatic file management.
Methods:
new(output_file: String, buffer_size: Option<usize>) -> Self: Creates a new batch writer with default configuration.
with_config(output_file: String, buffer_size: Option<usize>, config: ParquetRecordConfig) -> Self: Creates a new batch writer with custom configuration.
add_items(&self, items: Vec<T>) -> Result<(), io::Error>: Adds multiple items to the buffer and writes when buffer is full.
add_item(&self, item: T) -> Result<(), io::Error>: Adds a single item to the buffer and writes when buffer is full.
flush(&self) -> Result<(), io::Error>: Forces writing of all buffered items to the file.
close(self) -> Result<(), io::Error>: Closes the writer and finalizes the Parquet file.
close_no_consume(&self) -> Result<(), io::Error>: Closes the writer without consuming it (non-consuming close).
get_stats(&self) -> Result<WriteStats, io::Error>: Returns current write statistics.
buffer_len(&self) -> usize: Returns the current number of items in the buffer.
WriteStats StructContains statistics about the writing process.
Fields:
total_items_written: Total number of items written to the filetotal_batches_written: Total number of batches writtentotal_bytes_written: Total bytes written to the fileread_parquet_with_config<T>Sequential reading with custom configuration.
read_parquet<T>Sequential reading with default configuration.
read_parquet_columns_with_config<I>Read specific columns sequentially with custom configuration.
read_parquet_columns<I>Read specific columns sequentially with default configuration.
read_parquet_with_config_par<T>Parallel reading with custom configuration.
read_parquet_par<T>Parallel reading with default configuration.
read_parquet_columns_with_config_par<I>Parallel column reading with custom configuration.
read_parquet_columns_par<I>Parallel column reading with default configuration.
All writer operations are thread-safe and can be called from multiple threads simultaneously. The internal buffer is protected by a mutex, and when full, secondary buffers are swapped and written concurrently to maintain performance.
The library is designed around Arrow's memory layout for optimal performance:
This library uses a simple GitHub Action to publish to crates.io:
Cargo.toml and attempts to publish that versionCARGO_REGISTRY_TOKEN secret in GitHub repository settingsCargo.toml before merging to main to trigger a new releaseThis project is licensed under the MIT License - see the LICENSE file for details.