| Crates.io | midas_processor |
| lib.rs | midas_processor |
| version | 1.2.0 |
| created_at | 2025-07-17 16:53:27.259025+00 |
| updated_at | 2025-07-18 15:36:54.362328+00 |
| description | High-performance Rust tool for converting UK Met Office MIDAS weather datasets from BADC-CSV to optimized Parquet format |
| homepage | https://github.com/rjl-climate/midas_processor |
| repository | https://github.com/rjl-climate/midas_processor |
| max_upload_size | |
| id | 1757844 |
| size | 406,203 |
A high-performance Rust tool for converting UK Met Office MIDAS weather datasets from BADC-CSV format to optimized Parquet files for efficient analysis.
MIDAS Processor is part of a climate research toolkit designed to process historical UK weather data from the CEDA Archive. It transforms the original BADC-CSV format into modern, optimized Parquet files with significant performance improvements for analytical workloads.
This tool works as part of a complete climate data processing pipeline:
MIDAS (Met Office Integrated Data Archive System) contains historical weather observations from 1000+ UK land-based weather stations, spanning from the late 19th century to present day. The datasets include:
git clone https://github.com/your-org/midas-processor
cd midas-processor
cargo install --path .
cargo install midas-processor
midas-processor
This will show available datasets and let you select one interactively.# Interactive dataset selection
midas-processor
# Process specific dataset
midas-processor /path/to/uk-daily-rain-obs-202407
# Custom output location
midas-processor --output-path ./analysis/rain_data.parquet
# High compression for archival
midas-processor --compression zstd
# Schema analysis only (no conversion)
midas-processor --discovery-only --verbose
# Combine options
midas-processor /path/to/dataset --compression lz4 --verbose
| Option | Description | Default |
|---|---|---|
DATASET_PATH |
Path to MIDAS dataset directory (optional) | Auto-discover |
--output-path |
Custom output location | ../parquet/{dataset}.parquet |
--compression |
Compression algorithm (snappy/zstd/lz4/none) | snappy |
--discovery-only |
Analyze schema without converting | false |
--verbose |
Enable detailed logging | false |
Station-Timestamp Sorting: Data is sorted by station_id then ob_end_time for optimal query performance
Large Row Groups: 500K rows per group for better compression and fewer metadata operations
Column Statistics: Enabled for all columns to allow query engines to skip irrelevant data
Memory Streaming: Processes datasets larger than available RAM through streaming execution
import polars as pl
# Fast station-based query
df = pl.scan_parquet("rain_data.parquet") \
.filter(pl.col("station_id") == "00009") \
.collect()
# Time range analysis
monthly_avg = pl.scan_parquet("temperature_data.parquet") \
.filter(pl.col("ob_end_time").dt.year() == 2023) \
.group_by(["station_id", pl.col("ob_end_time").dt.month()]) \
.agg(pl.col("air_temperature").mean()) \
.collect()
import pandas as pd
# Read with automatic optimization
df = pd.read_parquet("rain_data.parquet")
# Station-specific analysis
station_data = df[df['station_id'] == '00009']
library(arrow)
# Lazy evaluation with Arrow
rain_data <- open_dataset("rain_data.parquet")
# Efficient aggregation
monthly_totals <- rain_data %>%
filter(year(ob_end_time) == 2023) %>%
group_by(station_id, month = month(ob_end_time)) %>%
summarise(total_rain = sum(prcp_amt, na.rm = TRUE)) %>%
collect()
Memory Issues
# For very large datasets, ensure sufficient RAM or use streaming
midas --verbose # Monitor memory usage
Performance Issues
# Check if storage is the bottleneck
midas --verbose # Shows processing rates
Cache Directory Not Found
# Ensure midas-fetcher has been run first
ls ~/Library/Application\ Support/midas-fetcher/cache/ # macOS
ls ~/.config/midas-fetcher/cache/ # Linux
This project is licensed under the MIT License - see the LICENSE file for details.
See CHANGELOG.md for detailed version history and release notes.
We welcome contributions! Please see our contributing guidelines for details.
git clone https://github.com/your-org/midas-processor
cd midas-processor
cargo build
cargo test
cargo fmt for formattingcargo clippy passes without warningsIf you use this tool in your research, please cite:
@software{midas_processor,
title = {MIDAS Processor: High-Performance Climate Data Processing},
author = {Richard Lyon},
year = {2025},
url = {https://github.com/rjl-climate/midas_processor}
}