nc2parquet

Crates.io	nc2parquet
lib.rs	nc2parquet
version	0.1.1
created_at	2025-09-30 17:32:24.840623+00
updated_at	2025-10-02 19:06:12.619378+00
description	High-performance NetCDF to Parquet converter with cloud storage support
homepage	https://github.com/rjmalves/nc2parquet
repository	https://github.com/rjmalves/nc2parquet
max_upload_size
id	1861430
size	448,640

Rogerio Alves (rjmalves)

documentation

README

nc2parquet

A high-performance Rust library and CLI tool for converting NetCDF files to Parquet format with advanced filtering, cloud storage, and post-processing capabilities.

Features

High Performance: Built in Rust with efficient processing of large NetCDF datasets
Advanced Filtering: Multiple filter types with intersection logic for precise data extraction
Cloud Storage: Native Amazon S3 support for input and output files with async operations
Multi-Source Configuration: CLI arguments, environment variables, and JSON/YAML configuration files
Post-Processing Framework: Built-in DataFrame transformations including column renaming, unit conversion, and formula application
Professional CLI: Comprehensive command-line interface with progress indicators, logging, and shell completions

Installation

Command-Line Tool

# Install from source
cargo install --path .

# Or install from crates.io (when published)
cargo install nc2parquet

Library Dependency

Add to your Cargo.toml:

[dependencies]
nc2parquet = "0.1.0"

Quick Start

Command-Line Interface

The CLI provides comprehensive functionality with multiple subcommands:

# Basic conversion
nc2parquet convert input.nc output.parquet --variable temperature

# S3 to S3 conversion
nc2parquet convert s3://input-bucket/data.nc s3://output-bucket/result.parquet --variable pressure

# Conversion with filtering
nc2parquet convert data.nc result.parquet \
  --variable temperature \
  --range "latitude:30:60" \
  --list "pressure:1000,850,500"

# Conversion with post-processing
nc2parquet convert data.nc result.parquet \
  --variable temperature \
  --rename "temperature:temp_k" \
  --kelvin-to-celsius temp_k \
  --formula "temp_f:temp_k*1.8+32"

# Generate configuration templates
nc2parquet template basic -o config.json
nc2parquet template s3 --format yaml -o s3-config.yaml

# Validate configurations
nc2parquet validate config.json --detailed

# File information and inspection
nc2parquet info data.nc                           # Basic file info (human-readable)
nc2parquet info data.nc --detailed                # Include global attributes
nc2parquet info data.nc --variable temperature    # Show specific variable info
nc2parquet info data.nc --format json             # JSON output for scripting
nc2parquet info data.nc --format yaml             # YAML output
nc2parquet info data.nc --format csv              # CSV output (variables table)
nc2parquet info s3://bucket/data.nc --detailed    # Works with S3 files too

# Generate shell completions
nc2parquet completions bash > ~/.bash_completion.d/nc2parquet

Library Usage

Basic Conversion:

use nc2parquet::{JobConfig, process_netcdf_job_async};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = JobConfig::from_json(r#"
    {
        "nc_key": "data/temperature.nc",
        "variable_name": "temperature",
        "parquet_key": "output/temperature.parquet",
        "filters": [
            {
                "kind": "range",
                "params": {
                    "dimension_name": "latitude",
                    "min_value": 30.0,
                    "max_value": 60.0
                }
            }
        ]
    }
    "#)?;

    process_netcdf_job_async(&config).await?;
    Ok(())
}

S3 and Post-Processing:

use nc2parquet::{JobConfig, process_netcdf_job_async};
use nc2parquet::postprocess::*;
use std::collections::HashMap;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = JobConfig {
        nc_key: "s3://my-bucket/weather-data.nc".to_string(),
        variable_name: "temperature".to_string(),
        parquet_key: "s3://output-bucket/processed-temp.parquet".to_string(),
        filters: vec![
            // Spatial filter for North America
            FilterConfig::Range {
                params: RangeParams {
                    dimension_name: "latitude".to_string(),
                    min_value: 25.0,
                    max_value: 70.0,
                }
            },
            FilterConfig::Range {
                params: RangeParams {
                    dimension_name: "longitude".to_string(),
                    min_value: -140.0,
                    max_value: -60.0,
                }
            },
        ],
        postprocessing: Some(ProcessingPipelineConfig {
            name: Some("Weather Data Processing".to_string()),
            processors: vec![
                // Rename columns
                ProcessorConfig::RenameColumns {
                    mappings: {
                        let mut map = HashMap::new();
                        map.insert("temperature".to_string(), "temp_kelvin".to_string());
                        map
                    },
                },
                // Convert Kelvin to Celsius
                ProcessorConfig::UnitConvert {
                    column: "temp_kelvin".to_string(),
                    from_unit: "kelvin".to_string(),
                    to_unit: "celsius".to_string(),
                },
                // Add computed column
                ProcessorConfig::ApplyFormula {
                    target_column: "temp_fahrenheit".to_string(),
                    formula: "temp_kelvin * 1.8 - 459.67".to_string(),
                    source_columns: vec!["temp_kelvin".to_string()],
                },
            ],
        }),
    };

    process_netcdf_job_async(&config).await?;
    Ok(())
}

File Information and Inspection

The info subcommand provides comprehensive NetCDF file analysis capabilities:

Basic Usage

# Display file structure and metadata
nc2parquet info temperature_data.nc

Output:

NetCDF File Information:
  Path: temperature_data.nc
  File Size: 2.73 MB
  Dimensions: 4 total
    level (2)
    latitude (6)
    longitude (12)
    time (2, unlimited)
  Variables: 4 total
    latitude (Float(F32)) - dimensions: [latitude]
      @units: Str("degrees_north")
    longitude (Float(F32)) - dimensions: [longitude]
      @units: Str("degrees_east")
    pressure (Float(F32)) - dimensions: [time, level, latitude, longitude]
      @units: Str("hPa")
    temperature (Float(F32)) - dimensions: [time, level, latitude, longitude]
      @units: Str("celsius")

Advanced Features

Detailed Information:

# Include global attributes and extended metadata
nc2parquet info data.nc --detailed

Variable-Specific Analysis:

# Focus on a specific variable
nc2parquet info ocean_data.nc --variable sea_surface_temperature

Multiple Output Formats:

# JSON format for programmatic use
nc2parquet info data.nc --format json > file_info.json

# YAML format for human-readable structured output
nc2parquet info data.nc --format yaml

# CSV format for variable analysis (tabular data)
nc2parquet info data.nc --format csv > variables.csv

Cloud Storage Support:

# Analyze S3-hosted NetCDF files directly
nc2parquet info s3://climate-data/global_temperature.nc --detailed

JSON Output Structure

The JSON output provides a complete machine-readable representation:

{
  "path": "temperature_data.nc",
  "file_size": 2784,
  "total_dimensions": 4,
  "total_variables": 4,
  "dimensions": [
    {
      "name": "level",
      "length": 2,
      "is_unlimited": false
    },
    {
      "name": "time",
      "length": 2,
      "is_unlimited": true
    }
  ],
  "variables": [
    {
      "name": "temperature",
      "data_type": "Float(F32)",
      "dimensions": ["time", "level", "latitude", "longitude"],
      "shape": [2, 2, 6, 12],
      "attributes": {
        "units": "Str(\"celsius\")"
      }
    }
  ],
  "global_attributes": {}
}

Storage Support

nc2parquet supports both local filesystem and Amazon S3 storage:

Local Files

{
  "nc_key": "/path/to/input.nc",
  "parquet_key": "/path/to/output.parquet"
}

Amazon S3

{
  "nc_key": "s3://my-bucket/path/to/input.nc",
  "parquet_key": "s3://my-bucket/path/to/output.parquet"
}

Mixed Storage

{
  "nc_key": "s3://input-bucket/data.nc",
  "parquet_key": "/local/path/output.parquet"
}

AWS Configuration

For S3 support, configure AWS credentials using any of these methods:

Environment Variables

export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1

AWS Credentials File

# ~/.aws/credentials
[default]
aws_access_key_id = your_access_key
aws_secret_access_key = your_secret_key

# ~/.aws/config
[default]
region = us-east-1

IAM Roles

When running on AWS infrastructure (EC2, Lambda, ECS), IAM roles are automatically used.

Filter Types

nc2parquet supports four types of filters that can be combined for precise data extraction:

1. Range Filter

Selects values within a numeric range:

{
  "kind": "range",
  "params": {
    "dimension_name": "temperature",
    "min_value": -10.0,
    "max_value": 35.0
  }
}

2. List Filter

Selects specific discrete values:

{
  "kind": "list",
  "params": {
    "dimension_name": "pressure_level",
    "values": [850.0, 500.0, 200.0]
  }
}

3. 2D Point Filter

Selects spatial coordinates with tolerance:

{
  "kind": "2d_point",
  "params": {
    "lat_dimension_name": "latitude",
    "lon_dimension_name": "longitude",
    "points": [
      [40.7, -74.0],
      [51.5, -0.1]
    ],
    "tolerance": 0.1
  }
}

4. 3D Point Filter

Selects spatiotemporal coordinates:

{
  "kind": "3d_point",
  "params": {
    "time_dimension_name": "time",
    "lat_dimension_name": "latitude",
    "lon_dimension_name": "longitude",
    "steps": [0.0, 6.0, 12.0],
    "points": [
      [40.7, -74.0],
      [51.5, -0.1]
    ],
    "tolerance": 0.1
  }
}

Configuration Examples

Simple Weather Data Extraction

{
  "nc_key": "weather_data.nc",
  "variable_name": "temperature",
  "parquet_key": "temperature_filtered.parquet",
  "filters": [
    {
      "kind": "range",
      "params": {
        "dimension_name": "latitude",
        "min_value": 30.0,
        "max_value": 45.0
      }
    }
  ]
}

Multi-Filter Climate Analysis

{
  "nc_key": "s3://climate-data/global_temps.nc",
  "variable_name": "temperature",
  "parquet_key": "s3://results/urban_temps.parquet",
  "filters": [
    {
      "kind": "range",
      "params": {
        "dimension_name": "time",
        "min_value": 20200101.0,
        "max_value": 20231231.0
      }
    },
    {
      "kind": "2d_point",
      "params": {
        "lat_dimension_name": "latitude",
        "lon_dimension_name": "longitude",
        "points": [
          [40.7128, -74.006],
          [34.0522, -118.2437],
          [41.8781, -87.6298]
        ],
        "tolerance": 0.5
      }
    }
  ]
}

Ocean Data Processing

{
  "nc_key": "s3://ocean-data/sst_2023.nc",
  "variable_name": "sea_surface_temperature",
  "parquet_key": "atlantic_sst.parquet",
  "filters": [
    {
      "kind": "range",
      "params": {
        "dimension_name": "longitude",
        "min_value": -80.0,
        "max_value": -10.0
      }
    },
    {
      "kind": "range",
      "params": {
        "dimension_name": "latitude",
        "min_value": 0.0,
        "max_value": 70.0
      }
    },
    {
      "kind": "list",
      "params": {
        "dimension_name": "depth",
        "values": [0.0, 5.0, 10.0]
      }
    }
  ]
}

Performance Tips

Use S3 Transfer Acceleration for faster uploads to S3
Apply filters early to reduce data transfer and processing time
Use specific coordinates rather than large ranges when possible
Consider data locality - process data in the same AWS region as your S3 buckets

Error Handling

The library provides detailed error messages for common issues:

File not found: Clear indication of missing input files (local or S3)
Invalid NetCDF: Detailed validation errors for malformed files
Permission errors: Specific AWS permission or filesystem access issues
Configuration errors: JSON parsing and validation errors with context

Testing

Run the full test suite:

cargo test

Run all tests (including S3 tests with public datasets):

cargo test --lib

The S3 integration tests now use public NOAA Climate Data Records and require no AWS credentials or configuration. The tests will automatically handle network connectivity issues gracefully.

Available S3 Tests:

test_public_s3_noaa_dataset_pipeline - Full pipeline test using NOAA Total Solar Irradiance data
test_noaa_s3_info_command - Info command test with NOAA public dataset
test_s3_storage_noaa_public_dataset - Storage layer test with public S3 data
test_storage_factory_noaa_public_dataset - Storage factory test with public data

To run only S3-related tests:

cargo test noaa -- --nocapture

Configuration Sources

nc2parquet supports multiple configuration sources with clear precedence:

Priority (highest to lowest):

CLI arguments
Environment variables
Configuration files

Environment Variables

All CLI options can be set via environment variables with the NC2PARQUET_ prefix:

# Core configuration
export NC2PARQUET_INPUT="s3://my-bucket/data.nc"
export NC2PARQUET_OUTPUT="s3://output-bucket/result.parquet"
export NC2PARQUET_VARIABLE="temperature"
export NC2PARQUET_CONFIG="/path/to/config.json"

# Processing options
export NC2PARQUET_FORCE=true
export NC2PARQUET_DRY_RUN=true

# Filter configuration
export NC2PARQUET_RANGE_FILTERS="lat:30:60,lon:-120:-80"
export NC2PARQUET_LIST_FILTERS="pressure:1000,850,500;level:1,2,3"
export NC2PARQUET_POINT2D_FILTERS="lat,lon:40.7,-74.0:0.5"
export NC2PARQUET_POINT3D_FILTERS="time,lat,lon:0.0,40.7,-74.0:0.1"

# Override paths for specific scenarios
export NC2PARQUET_INPUT_OVERRIDE="/alternative/input.nc"
export NC2PARQUET_OUTPUT_OVERRIDE="/alternative/output.parquet"

Configuration Files

Support both JSON and YAML formats with automatic detection:

nc2parquet convert --config config.json
nc2parquet convert --config config.yaml

Post-Processing Framework

Transform DataFrames after extraction with built-in processors:

Available Processors

Column Renaming

--rename "old_name:new_name,temperature:temp_k"

Unit Conversion

--unit-convert "temperature:kelvin:celsius"
--kelvin-to-celsius temperature  # Shortcut for Kelvin→Celsius

Formula Application

--formula "temp_f:temp_c * 1.8 + 32"
--formula "heat_index:temp + humidity * 0.1"

DateTime Conversion (configuration only)
Data Aggregation (configuration only)

Post-Processing Configuration

{
  "nc_key": "weather.nc",
  "variable_name": "temperature",
  "parquet_key": "processed.parquet",
  "postprocessing": {
    "name": "Weather Data Pipeline",
    "processors": [
      {
        "type": "rename_columns",
        "mappings": {
          "temperature": "temp_k",
          "lat": "latitude",
          "lon": "longitude"
        }
      },
      {
        "type": "unit_convert",
        "column": "temp_k",
        "from_unit": "kelvin",
        "to_unit": "celsius"
      },
      {
        "type": "apply_formula",
        "target_column": "temp_fahrenheit",
        "formula": "temp_k * 1.8 - 459.67",
        "source_columns": ["temp_k"]
      },
      {
        "type": "datetime_convert",
        "column": "time",
        "base": "2000-01-01T00:00:00Z",
        "unit": "hours"
      }
    ]
  }
}

Pipeline Chaining

Processors are executed sequentially, allowing complex transformations:

nc2parquet convert weather.nc result.parquet \
  --variable temperature \
  --rename "temperature:temp_k,lat:latitude" \
  --kelvin-to-celsius temp_k \
  --formula "temp_f:temp_k*1.8+32" \
  --formula "heat_index:temp_k+humidity*0.05"

Architecture

nc2parquet uses a modular architecture:

Storage Layer: Unified interface for local and S3 operations
Filter System: Composable filters with intersection logic
Processing Pipeline: Efficient async processing with minimal memory usage
Configuration: Type-safe JSON configuration with validation

Contributing

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Make your changes and add tests
Ensure tests pass: cargo test
Submit a pull request

Roadmap

Sprint 2: CLI application with argument parsing and configuration management
Sprint 3: Advanced post-processing with data aggregation and statistics
Sprint 4: Performance optimizations and streaming processing
Sprint 5: Extended cloud support (GCS, Azure Blob Storage)
Sprint 6: Monitoring and logging improvements
Sprint 7: Advanced filtering and data transformation capabilities

License

This project is licensed under the MIT License - see the LICENSE file for details.

Public Dataset Integration

nc2parquet includes comprehensive tests using the NOAA Climate Data Record (CDR) for Total Solar Irradiance from AWS Open Data:

# Analyze NOAA Total Solar Irradiance data
nc2parquet info s3://noaa-cdr-total-solar-irradiance-pds/data/daily/tsi_v02r01_daily_s18820101_e18821231_c20170717.nc

# Convert NOAA data with filtering
nc2parquet convert \
  s3://noaa-cdr-total-solar-irradiance-pds/data/daily/tsi_v02r01_daily_s18820101_e18821231_c20170717.nc \
  solar_irradiance_1882.parquet \
  --variable TSI \
  --range "time:99346:99350"

This public dataset provides:

365 daily measurements of Total Solar Irradiance for 1882
No AWS credentials required - publicly accessible
Real-world NetCDF-4/HDF5 format for testing
Scientific metadata including units, uncertainty values, and CF-compliant attributes

The integration demonstrates nc2parquet's capability to process real climate science data from cloud storage seamlessly.

Examples

The examples/ directory contains sample NetCDF files and configuration examples:

examples/data/simple_xy.nc: Simple 2D test data
examples/data/pres_temp_4D.nc: 4D weather data with time series
examples/configs/: Sample configuration files for various use cases

Commit count: 0

nc2parquet

documentation

README

nc2parquet

Features

Installation

Command-Line Tool

Library Dependency

Quick Start

Command-Line Interface

Library Usage

File Information and Inspection

Basic Usage

Advanced Features

JSON Output Structure

Storage Support

Local Files

Amazon S3

Mixed Storage

AWS Configuration

Environment Variables

AWS Credentials File

IAM Roles

Filter Types

1. Range Filter

2. List Filter

3. 2D Point Filter

4. 3D Point Filter

Configuration Examples

Simple Weather Data Extraction

Multi-Filter Climate Analysis

Ocean Data Processing

Performance Tips

Error Handling

Testing

Configuration Sources

Environment Variables

Configuration Files

Post-Processing Framework

Available Processors

Post-Processing Configuration

Pipeline Chaining

Architecture

Contributing

Roadmap

License

Public Dataset Integration

Examples

cargo fmt