| Crates.io | genesis-preflight |
| lib.rs | genesis-preflight |
| version | 0.1.0 |
| created_at | 2025-11-29 21:25:26.382063+00 |
| updated_at | 2025-11-29 21:25:26.382063+00 |
| description | A zero-dependency CLI tool for validating and documenting scientific datasets in preparation for DOE Genesis Mission ingestion |
| homepage | https://github.com/clay-good/genesis-preflight |
| repository | https://github.com/clay-good/genesis-preflight |
| max_upload_size | |
| id | 1957507 |
| size | 419,642 |
Prepare Your Scientific Data for the DOE Genesis Mission
Genesis Preflight is a zero-dependency command-line tool that validates datasets against FAIR principles (Findable, Accessible, Interoperable, Reusable) and generates the documentation required for AI-ready scientific data.
Genesis Preflight is dedicated to the advancement of American scientific research and to the success of the Department of Energy's Genesis Mission. This tool was built to serve the researchers, scientists, and data stewards who work tirelessly to expand humanity's understanding of the natural world and to solve the critical challenges facing our nation.
Scientific data is the foundation of discovery. But too often, valuable datasets remain inaccessible, poorly documented, or incompatible with modern AI-driven research methods. The Genesis Mission aims to change this by creating a comprehensive platform for AI-ready scientific data. Genesis Preflight exists to help researchers prepare their data for this important mission.
This tool is freely available open source software, released under the MIT License. It belongs to the American people and to the global scientific community. There are no paywalls, no subscription fees, no vendor lock-in. Every researcher, from graduate students to principal investigators at national laboratories, can use this tool without restriction.
Genesis Preflight scans scientific datasets and performs the following operations:
cargo install genesis-preflight
git clone https://github.com/clay-good/genesis-preflight
cd genesis-preflight
cargo build --release
./target/release/genesis-preflight --version
Validate an existing dataset:
genesis-preflight scan ./my-dataset
You'll receive a compliance score (0-100) and a list of issues to fix.
Create required metadata files automatically:
genesis-preflight generate ./my-dataset
This creates:
README.md - Human-readable overviewmetadata.json - Machine-readable metadataDATACARD.md - Provenance documentationMANIFEST.txt - SHA-256 file hashes*.schema.json - Data structure definitions (for CSV files)Edit the generated files and replace all [TODO] markers with your dataset details.
genesis-preflight scan ./my-dataset
Aim for a score of 80 or higher.
# Quick scan without hashing (fast)
genesis-preflight scan ./dataset --no-hash
# Verbose output for debugging
genesis-preflight scan ./dataset --verbose
# Quiet mode (errors only)
genesis-preflight scan ./dataset --quiet
# Generate in dataset directory (default)
genesis-preflight generate ./dataset
# Generate in custom location
genesis-preflight generate ./dataset --output-dir ./docs
# Generate with verbose output
genesis-preflight generate ./dataset --verbose
# JSON output for CI/CD
genesis-preflight report ./dataset --json
# Save to file
genesis-preflight report ./dataset --json > compliance-report.json
# Extract specific data
genesis-preflight report ./dataset --json | jq '.score.total'
# GitHub Actions example
- name: Validate Dataset
run: |
genesis-preflight scan ./data --json > report.json
score=$(jq '.score.total' report.json)
[ "$score" -ge 80 ] || exit 1
Commands:
scan <path> - Scan and validate a datasetgenerate <path> - Scan, validate, and generate documentationreport <path> - Generate detailed compliance reportFlags:
-o, --output-dir <dir> - Directory for generated files (default: dataset root)-v, --verbose - Show detailed progress information-q, --quiet - Suppress all non-error output--no-hash - Skip SHA-256 hashing for faster scanning--json - Output report in JSON format (report command only)-h, --help - Print help message-V, --version - Print version informationExit Codes:
0 - No issues, score >= 801 - Warnings present or score < 802 - Critical issues or score < 50This tool is built exclusively with the Rust standard library with zero external dependencies. This deliberate choice ensures the software can be audited, trusted, and used in sensitive research environments, including those handling data subject to export controls or classification review.
Transparency: Every algorithm is documented. Every decision is explainable. No black boxes.
Security: No supply chain attacks through compromised dependencies. No network access. No data modification. No telemetry.
Reliability: No version conflicts or dependency resolution issues. The tool will continue to work regardless of external package availability.
Accessibility: Clear, professional output with actionable guidance, not just error messages.
Respect: Automates tedious documentation tasks while never transmitting or exfiltrating information.
Trust: Researchers can verify every line of code that touches their data within this single repository.
Genesis Preflight calculates a compliance score (0-100) based on dataset quality:
Each FAIR principle is scored separately (0-25 points each):
Critical (-20 points each):
Warning (-5 points each):
Info (-1 point each):
Comprehensive documentation is available in the docs/ directory:
Unlike tools that sample only the first few rows, Genesis Preflight analyzes every row of your CSV files using memory-efficient streaming. This ensures accurate type inference even for large datasets:
Properly handles:
"hello, world")"" for literal ")Goes beyond checking if files exist to validate their actual content:
When a MANIFEST.txt exists, validates that:
Generates templates with TODO markers for completion:
README.md - Dataset overview with auto-populated file statisticsmetadata.json - Structured metadata following data catalog standardsDATACARD.md - Provenance documentation templateMANIFEST.txt - SHA-256 checksums for all files*.schema.json - Inferred structure for CSV files (based on full-file analysis)#![forbid(unsafe_code)]Genesis Preflight can be operationalized at DOE scale through several deployment strategies:
The tool's zero-dependency architecture, read-only operations, and comprehensive output formats make it suitable for integration into diverse DOE computational and data management environments while maintaining the security requirements of sensitive research infrastructure.