datasetq

Crates.io	datasetq
lib.rs	datasetq
version	0.1.3
created_at	2025-12-15 18:41:19.013329+00
updated_at	2025-12-15 18:41:19.013329+00
description	A data processing tool with a jq-like syntax for structured data formats, including CSV, JSON, Parquet, Avro, and more.
homepage	https://datasetq.com
repository	https://github.com/durableprogramming/dsq
max_upload_size
id	1986522
size	448,905

David J Berube (djberube)

documentation

https://docs.rs/dsq

README

dsq

dsq (pronounced "disk") is a high-performance data processing tool that extends jq-like syntax to work with structured data formats including Parquet, Avro, CSV, JSON Lines, Arrow, and more. Built on Polars, dsq provides fast data manipulation across multiple file formats with familiar filter syntax.

Key Features

Format Flexibility - Process Parquet, Avro, CSV, TSV, JSON Lines, Arrow, and more with automatic format detection
Performance - Built on Polars DataFrames with lazy evaluation, columnar operations, and efficient memory usage
Familiar Syntax - jq-inspired filter syntax extended to tabular data operations
Correctness - Proper type handling and clear error messages

Installation

Binaries

Download binaries for Linux, Mac, and Windows from the releases page.

On Linux:

curl -fsSL https://github.com/datasetq/datasetq/releases/latest/download/dsq-$(uname -m)-unknown-linux-musl -o dsq && chmod +x dsq

From Source

Install with Rust toolchain (see https://rustup.rs/):

cargo install --locked dsq
cargo install --locked --git https://github.com/datasetq/datasetq  # development version

Or build from the repository:

cargo build --release  # creates target/release/dsq
cargo install --locked --path dsq  # installs binary

Quick Start

Process CSV data:

dsq 'map(select(.age > 30))' people.csv

Convert between formats:

dsq '.' data.csv --output data.parquet

Aggregate data:

dsq 'group_by(.department) | map({dept: .[0].department, count: length})' employees.parquet

Filter and transform:

dsq 'map(select(.status == "active") | {name, email})' users.json

Process multiple files:

dsq 'flatten | group_by(.category)' sales_*.csv

Use lazy evaluation for large datasets:

dsq --lazy 'filter(.amount > 1000)' transactions.parquet

Interactive Mode

Start an interactive REPL to experiment with filters:

dsq --interactive

Available REPL commands:

load <file> - Load data from a file
show - Display current data
explain <filter> - Explain what a filter does
history - Show command history
help - Show help message
quit - Exit

Common Operations

Format Conversion

dsq convert input.csv output.parquet

Data Inspection

dsq inspect data.parquet --schema --sample 10 --stats

File Merging

dsq merge data1.csv data2.csv --output combined.csv

Shell Completions

dsq completions bash >> ~/.bashrc

Supported Formats

Input/Output:

CSV/TSV - Delimited text with customizable options
Parquet - Columnar storage with compression
JSON/JSON Lines - Standard and newline-delimited JSON
Arrow - Columnar in-memory format
Avro - Row-based serialization
ADT - ASCII delimited text (control characters)

Output Only:

Excel (.xlsx)
ORC - Optimized row columnar

Format detection is automatic based on file extensions. Override with --input-format and --output-format.

Documentation

Architecture - Core library structure and modules
Functions - Built-in function reference
Formats - Format support and options
API - Library usage examples
Configuration - Configuration file reference
Language - Filter language syntax

Command-Line Options

Input/Output

-i, --input-format <FORMAT> - Specify input format
-o, --output <FILE> - Output file (stdout by default)
--output-format <FORMAT> - Specify output format
-f, --filter-file <FILE> - Read filter from file

Processing

--lazy - Enable lazy evaluation
--dataframe-optimizations - Enable DataFrame optimizations
--threads <N> - Number of threads
--memory-limit <LIMIT> - Memory limit (e.g., 1GB)

Output Formatting

-c, --compact-output - Compact output
-r, --raw-output - Raw strings without quotes
-S, --sort-keys - Sort object keys

Debugging

-v, --verbose - Increase verbosity
--explain - Show execution plan
--stats - Show execution statistics
-I, --interactive - Start REPL mode

Configuration

Configuration files are searched in:

Current directory (.dsq.toml, dsq.yaml)
Home directory (~/.config/dsq/)
System directory (/etc/dsq/)

Manage configuration:

dsq config show                  # Show current configuration
dsq config set filter.lazy_evaluation true
dsq config init                  # Create default config

See Configuration for details.

Contributing

Contributions are welcome! Please ensure:

Compatibility with jq syntax where possible
Tests pass with cargo test
Documentation updated for new features
Performance implications considered

See CONTRIBUTING.md for details.

Acknowledgements

dsq builds on excellent foundations from:

jq - The original and inimitable jq
jaq - jq clone inspiring our syntax compatibility
Polars - High-performance DataFrame library
Arrow - Columnar memory format

Special thanks to Ronald Duncan for defining the ASCII Delimited Text (ADT) format.

Our GitHub Actions disk space cleanup script was inspired by the Apache Flink project.

License

See LICENSE file for details.

Commit count: 0