| Crates.io | nc2parquet |
| lib.rs | nc2parquet |
| version | 0.1.1 |
| created_at | 2025-09-30 17:32:24.840623+00 |
| updated_at | 2025-10-02 19:06:12.619378+00 |
| description | High-performance NetCDF to Parquet converter with cloud storage support |
| homepage | https://github.com/rjmalves/nc2parquet |
| repository | https://github.com/rjmalves/nc2parquet |
| max_upload_size | |
| id | 1861430 |
| size | 448,640 |
A high-performance Rust library and CLI tool for converting NetCDF files to Parquet format with advanced filtering, cloud storage, and post-processing capabilities.
High Performance: Built in Rust with efficient processing of large NetCDF datasets
Advanced Filtering: Multiple filter types with intersection logic for precise data extraction
Cloud Storage: Native Amazon S3 support for input and output files with async operations
Multi-Source Configuration: CLI arguments, environment variables, and JSON/YAML configuration files
Post-Processing Framework: Built-in DataFrame transformations including column renaming, unit conversion, and formula application
Professional CLI: Comprehensive command-line interface with progress indicators, logging, and shell completions
# Install from source
cargo install --path .
# Or install from crates.io (when published)
cargo install nc2parquet
Add to your Cargo.toml:
[dependencies]
nc2parquet = "0.1.0"
The CLI provides comprehensive functionality with multiple subcommands:
# Basic conversion
nc2parquet convert input.nc output.parquet --variable temperature
# S3 to S3 conversion
nc2parquet convert s3://input-bucket/data.nc s3://output-bucket/result.parquet --variable pressure
# Conversion with filtering
nc2parquet convert data.nc result.parquet \
--variable temperature \
--range "latitude:30:60" \
--list "pressure:1000,850,500"
# Conversion with post-processing
nc2parquet convert data.nc result.parquet \
--variable temperature \
--rename "temperature:temp_k" \
--kelvin-to-celsius temp_k \
--formula "temp_f:temp_k*1.8+32"
# Generate configuration templates
nc2parquet template basic -o config.json
nc2parquet template s3 --format yaml -o s3-config.yaml
# Validate configurations
nc2parquet validate config.json --detailed
# File information and inspection
nc2parquet info data.nc # Basic file info (human-readable)
nc2parquet info data.nc --detailed # Include global attributes
nc2parquet info data.nc --variable temperature # Show specific variable info
nc2parquet info data.nc --format json # JSON output for scripting
nc2parquet info data.nc --format yaml # YAML output
nc2parquet info data.nc --format csv # CSV output (variables table)
nc2parquet info s3://bucket/data.nc --detailed # Works with S3 files too
# Generate shell completions
nc2parquet completions bash > ~/.bash_completion.d/nc2parquet
Basic Conversion:
use nc2parquet::{JobConfig, process_netcdf_job_async};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let config = JobConfig::from_json(r#"
{
"nc_key": "data/temperature.nc",
"variable_name": "temperature",
"parquet_key": "output/temperature.parquet",
"filters": [
{
"kind": "range",
"params": {
"dimension_name": "latitude",
"min_value": 30.0,
"max_value": 60.0
}
}
]
}
"#)?;
process_netcdf_job_async(&config).await?;
Ok(())
}
S3 and Post-Processing:
use nc2parquet::{JobConfig, process_netcdf_job_async};
use nc2parquet::postprocess::*;
use std::collections::HashMap;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let config = JobConfig {
nc_key: "s3://my-bucket/weather-data.nc".to_string(),
variable_name: "temperature".to_string(),
parquet_key: "s3://output-bucket/processed-temp.parquet".to_string(),
filters: vec![
// Spatial filter for North America
FilterConfig::Range {
params: RangeParams {
dimension_name: "latitude".to_string(),
min_value: 25.0,
max_value: 70.0,
}
},
FilterConfig::Range {
params: RangeParams {
dimension_name: "longitude".to_string(),
min_value: -140.0,
max_value: -60.0,
}
},
],
postprocessing: Some(ProcessingPipelineConfig {
name: Some("Weather Data Processing".to_string()),
processors: vec![
// Rename columns
ProcessorConfig::RenameColumns {
mappings: {
let mut map = HashMap::new();
map.insert("temperature".to_string(), "temp_kelvin".to_string());
map
},
},
// Convert Kelvin to Celsius
ProcessorConfig::UnitConvert {
column: "temp_kelvin".to_string(),
from_unit: "kelvin".to_string(),
to_unit: "celsius".to_string(),
},
// Add computed column
ProcessorConfig::ApplyFormula {
target_column: "temp_fahrenheit".to_string(),
formula: "temp_kelvin * 1.8 - 459.67".to_string(),
source_columns: vec!["temp_kelvin".to_string()],
},
],
}),
};
process_netcdf_job_async(&config).await?;
Ok(())
}
The info subcommand provides comprehensive NetCDF file analysis capabilities:
# Display file structure and metadata
nc2parquet info temperature_data.nc
Output:
NetCDF File Information:
Path: temperature_data.nc
File Size: 2.73 MB
Dimensions: 4 total
level (2)
latitude (6)
longitude (12)
time (2, unlimited)
Variables: 4 total
latitude (Float(F32)) - dimensions: [latitude]
@units: Str("degrees_north")
longitude (Float(F32)) - dimensions: [longitude]
@units: Str("degrees_east")
pressure (Float(F32)) - dimensions: [time, level, latitude, longitude]
@units: Str("hPa")
temperature (Float(F32)) - dimensions: [time, level, latitude, longitude]
@units: Str("celsius")
Detailed Information:
# Include global attributes and extended metadata
nc2parquet info data.nc --detailed
Variable-Specific Analysis:
# Focus on a specific variable
nc2parquet info ocean_data.nc --variable sea_surface_temperature
Multiple Output Formats:
# JSON format for programmatic use
nc2parquet info data.nc --format json > file_info.json
# YAML format for human-readable structured output
nc2parquet info data.nc --format yaml
# CSV format for variable analysis (tabular data)
nc2parquet info data.nc --format csv > variables.csv
Cloud Storage Support:
# Analyze S3-hosted NetCDF files directly
nc2parquet info s3://climate-data/global_temperature.nc --detailed
The JSON output provides a complete machine-readable representation:
{
"path": "temperature_data.nc",
"file_size": 2784,
"total_dimensions": 4,
"total_variables": 4,
"dimensions": [
{
"name": "level",
"length": 2,
"is_unlimited": false
},
{
"name": "time",
"length": 2,
"is_unlimited": true
}
],
"variables": [
{
"name": "temperature",
"data_type": "Float(F32)",
"dimensions": ["time", "level", "latitude", "longitude"],
"shape": [2, 2, 6, 12],
"attributes": {
"units": "Str(\"celsius\")"
}
}
],
"global_attributes": {}
}
nc2parquet supports both local filesystem and Amazon S3 storage:
{
"nc_key": "/path/to/input.nc",
"parquet_key": "/path/to/output.parquet"
}
{
"nc_key": "s3://my-bucket/path/to/input.nc",
"parquet_key": "s3://my-bucket/path/to/output.parquet"
}
{
"nc_key": "s3://input-bucket/data.nc",
"parquet_key": "/local/path/output.parquet"
}
For S3 support, configure AWS credentials using any of these methods:
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1
# ~/.aws/credentials
[default]
aws_access_key_id = your_access_key
aws_secret_access_key = your_secret_key
# ~/.aws/config
[default]
region = us-east-1
When running on AWS infrastructure (EC2, Lambda, ECS), IAM roles are automatically used.
nc2parquet supports four types of filters that can be combined for precise data extraction:
Selects values within a numeric range:
{
"kind": "range",
"params": {
"dimension_name": "temperature",
"min_value": -10.0,
"max_value": 35.0
}
}
Selects specific discrete values:
{
"kind": "list",
"params": {
"dimension_name": "pressure_level",
"values": [850.0, 500.0, 200.0]
}
}
Selects spatial coordinates with tolerance:
{
"kind": "2d_point",
"params": {
"lat_dimension_name": "latitude",
"lon_dimension_name": "longitude",
"points": [
[40.7, -74.0],
[51.5, -0.1]
],
"tolerance": 0.1
}
}
Selects spatiotemporal coordinates:
{
"kind": "3d_point",
"params": {
"time_dimension_name": "time",
"lat_dimension_name": "latitude",
"lon_dimension_name": "longitude",
"steps": [0.0, 6.0, 12.0],
"points": [
[40.7, -74.0],
[51.5, -0.1]
],
"tolerance": 0.1
}
}
{
"nc_key": "weather_data.nc",
"variable_name": "temperature",
"parquet_key": "temperature_filtered.parquet",
"filters": [
{
"kind": "range",
"params": {
"dimension_name": "latitude",
"min_value": 30.0,
"max_value": 45.0
}
}
]
}
{
"nc_key": "s3://climate-data/global_temps.nc",
"variable_name": "temperature",
"parquet_key": "s3://results/urban_temps.parquet",
"filters": [
{
"kind": "range",
"params": {
"dimension_name": "time",
"min_value": 20200101.0,
"max_value": 20231231.0
}
},
{
"kind": "2d_point",
"params": {
"lat_dimension_name": "latitude",
"lon_dimension_name": "longitude",
"points": [
[40.7128, -74.006],
[34.0522, -118.2437],
[41.8781, -87.6298]
],
"tolerance": 0.5
}
}
]
}
{
"nc_key": "s3://ocean-data/sst_2023.nc",
"variable_name": "sea_surface_temperature",
"parquet_key": "atlantic_sst.parquet",
"filters": [
{
"kind": "range",
"params": {
"dimension_name": "longitude",
"min_value": -80.0,
"max_value": -10.0
}
},
{
"kind": "range",
"params": {
"dimension_name": "latitude",
"min_value": 0.0,
"max_value": 70.0
}
},
{
"kind": "list",
"params": {
"dimension_name": "depth",
"values": [0.0, 5.0, 10.0]
}
}
]
}
The library provides detailed error messages for common issues:
Run the full test suite:
cargo test
Run all tests (including S3 tests with public datasets):
cargo test --lib
The S3 integration tests now use public NOAA Climate Data Records and require no AWS credentials or configuration. The tests will automatically handle network connectivity issues gracefully.
Available S3 Tests:
test_public_s3_noaa_dataset_pipeline - Full pipeline test using NOAA Total Solar Irradiance datatest_noaa_s3_info_command - Info command test with NOAA public datasettest_s3_storage_noaa_public_dataset - Storage layer test with public S3 datatest_storage_factory_noaa_public_dataset - Storage factory test with public dataTo run only S3-related tests:
cargo test noaa -- --nocapture
nc2parquet supports multiple configuration sources with clear precedence:
Priority (highest to lowest):
All CLI options can be set via environment variables with the NC2PARQUET_ prefix:
# Core configuration
export NC2PARQUET_INPUT="s3://my-bucket/data.nc"
export NC2PARQUET_OUTPUT="s3://output-bucket/result.parquet"
export NC2PARQUET_VARIABLE="temperature"
export NC2PARQUET_CONFIG="/path/to/config.json"
# Processing options
export NC2PARQUET_FORCE=true
export NC2PARQUET_DRY_RUN=true
# Filter configuration
export NC2PARQUET_RANGE_FILTERS="lat:30:60,lon:-120:-80"
export NC2PARQUET_LIST_FILTERS="pressure:1000,850,500;level:1,2,3"
export NC2PARQUET_POINT2D_FILTERS="lat,lon:40.7,-74.0:0.5"
export NC2PARQUET_POINT3D_FILTERS="time,lat,lon:0.0,40.7,-74.0:0.1"
# Override paths for specific scenarios
export NC2PARQUET_INPUT_OVERRIDE="/alternative/input.nc"
export NC2PARQUET_OUTPUT_OVERRIDE="/alternative/output.parquet"
Support both JSON and YAML formats with automatic detection:
nc2parquet convert --config config.json
nc2parquet convert --config config.yaml
Transform DataFrames after extraction with built-in processors:
Column Renaming
--rename "old_name:new_name,temperature:temp_k"
Unit Conversion
--unit-convert "temperature:kelvin:celsius"
--kelvin-to-celsius temperature # Shortcut for Kelvin→Celsius
Formula Application
--formula "temp_f:temp_c * 1.8 + 32"
--formula "heat_index:temp + humidity * 0.1"
DateTime Conversion (configuration only)
Data Aggregation (configuration only)
{
"nc_key": "weather.nc",
"variable_name": "temperature",
"parquet_key": "processed.parquet",
"postprocessing": {
"name": "Weather Data Pipeline",
"processors": [
{
"type": "rename_columns",
"mappings": {
"temperature": "temp_k",
"lat": "latitude",
"lon": "longitude"
}
},
{
"type": "unit_convert",
"column": "temp_k",
"from_unit": "kelvin",
"to_unit": "celsius"
},
{
"type": "apply_formula",
"target_column": "temp_fahrenheit",
"formula": "temp_k * 1.8 - 459.67",
"source_columns": ["temp_k"]
},
{
"type": "datetime_convert",
"column": "time",
"base": "2000-01-01T00:00:00Z",
"unit": "hours"
}
]
}
}
Processors are executed sequentially, allowing complex transformations:
nc2parquet convert weather.nc result.parquet \
--variable temperature \
--rename "temperature:temp_k,lat:latitude" \
--kelvin-to-celsius temp_k \
--formula "temp_f:temp_k*1.8+32" \
--formula "heat_index:temp_k+humidity*0.05"
nc2parquet uses a modular architecture:
git checkout -b feature/amazing-featurecargo testThis project is licensed under the MIT License - see the LICENSE file for details.
nc2parquet includes comprehensive tests using the NOAA Climate Data Record (CDR) for Total Solar Irradiance from AWS Open Data:
# Analyze NOAA Total Solar Irradiance data
nc2parquet info s3://noaa-cdr-total-solar-irradiance-pds/data/daily/tsi_v02r01_daily_s18820101_e18821231_c20170717.nc
# Convert NOAA data with filtering
nc2parquet convert \
s3://noaa-cdr-total-solar-irradiance-pds/data/daily/tsi_v02r01_daily_s18820101_e18821231_c20170717.nc \
solar_irradiance_1882.parquet \
--variable TSI \
--range "time:99346:99350"
This public dataset provides:
The integration demonstrates nc2parquet's capability to process real climate science data from cloud storage seamlessly.
The examples/ directory contains sample NetCDF files and configuration examples:
examples/data/simple_xy.nc: Simple 2D test dataexamples/data/pres_temp_4D.nc: 4D weather data with time seriesexamples/configs/: Sample configuration files for various use cases