| Crates.io | datasetq |
| lib.rs | datasetq |
| version | 0.1.3 |
| created_at | 2025-12-15 18:41:19.013329+00 |
| updated_at | 2025-12-15 18:41:19.013329+00 |
| description | A data processing tool with a jq-like syntax for structured data formats, including CSV, JSON, Parquet, Avro, and more. |
| homepage | https://datasetq.com |
| repository | https://github.com/durableprogramming/dsq |
| max_upload_size | |
| id | 1986522 |
| size | 448,905 |
dsq (pronounced "disk") is a high-performance data processing tool that extends jq-like syntax to work with structured data formats including Parquet, Avro, CSV, JSON Lines, Arrow, and more. Built on Polars, dsq provides fast data manipulation across multiple file formats with familiar filter syntax.
Download binaries for Linux, Mac, and Windows from the releases page.
On Linux:
curl -fsSL https://github.com/datasetq/datasetq/releases/latest/download/dsq-$(uname -m)-unknown-linux-musl -o dsq && chmod +x dsq
Install with Rust toolchain (see https://rustup.rs/):
cargo install --locked dsq
cargo install --locked --git https://github.com/datasetq/datasetq # development version
Or build from the repository:
cargo build --release # creates target/release/dsq
cargo install --locked --path dsq # installs binary
Process CSV data:
dsq 'map(select(.age > 30))' people.csv
Convert between formats:
dsq '.' data.csv --output data.parquet
Aggregate data:
dsq 'group_by(.department) | map({dept: .[0].department, count: length})' employees.parquet
Filter and transform:
dsq 'map(select(.status == "active") | {name, email})' users.json
Process multiple files:
dsq 'flatten | group_by(.category)' sales_*.csv
Use lazy evaluation for large datasets:
dsq --lazy 'filter(.amount > 1000)' transactions.parquet
Start an interactive REPL to experiment with filters:
dsq --interactive
Available REPL commands:
load <file> - Load data from a fileshow - Display current dataexplain <filter> - Explain what a filter doeshistory - Show command historyhelp - Show help messagequit - Exitdsq convert input.csv output.parquet
dsq inspect data.parquet --schema --sample 10 --stats
dsq merge data1.csv data2.csv --output combined.csv
dsq completions bash >> ~/.bashrc
Input/Output:
Output Only:
Format detection is automatic based on file extensions. Override with --input-format and --output-format.
-i, --input-format <FORMAT> - Specify input format-o, --output <FILE> - Output file (stdout by default)--output-format <FORMAT> - Specify output format-f, --filter-file <FILE> - Read filter from file--lazy - Enable lazy evaluation--dataframe-optimizations - Enable DataFrame optimizations--threads <N> - Number of threads--memory-limit <LIMIT> - Memory limit (e.g., 1GB)-c, --compact-output - Compact output-r, --raw-output - Raw strings without quotes-S, --sort-keys - Sort object keys-v, --verbose - Increase verbosity--explain - Show execution plan--stats - Show execution statistics-I, --interactive - Start REPL modeConfiguration files are searched in:
.dsq.toml, dsq.yaml)~/.config/dsq/)/etc/dsq/)Manage configuration:
dsq config show # Show current configuration
dsq config set filter.lazy_evaluation true
dsq config init # Create default config
See Configuration for details.
Contributions are welcome! Please ensure:
cargo testSee CONTRIBUTING.md for details.
dsq builds on excellent foundations from:
Special thanks to Ronald Duncan for defining the ASCII Delimited Text (ADT) format.
Our GitHub Actions disk space cleanup script was inspired by the Apache Flink project.
See LICENSE file for details.