| Crates.io | dply |
| lib.rs | dply |
| version | 0.3.5 |
| created_at | 2023-05-08 09:40:29.844557+00 |
| updated_at | 2025-03-06 13:05:33.573581+00 |
| description | A command line data manipulation tool inspired by the dplyr grammar. |
| homepage | |
| repository | https://github.com/vincev/dply-rs |
| max_upload_size | |
| id | 859644 |
| size | 683,780 |
dply is a command line tool for viewing, querying, and writing csv and parquet files, inspired by dplyr.
A dply pipeline consists of a number of functions to read, transform, or write Parquet or CSV files.
The functions csv, json and parquet read and write data for their respective
formats. The following two steps pipeline converts a Parquet file to NdJSON:
$ dply -c 'parquet("nyctaxi.parquet") | json("nyctaxi.json")'
We can use a select step if we want to convert a subset of the columns:
$ dply -c 'parquet("nyctaxi.parquet") |
select(ends_with("time"), payment_type) |
json("nyctaxi.json")'
$ head -2 nyctaxi.json| jq
{
"payment_type": "Credit card",
"tpep_dropoff_datetime": "2022-11-22T19:45:53",
"tpep_pickup_datetime": "2022-11-22T19:27:01"
}
{
"payment_type": "Cash",
"tpep_dropoff_datetime": "2022-11-27T16:50:06",
"tpep_pickup_datetime": "2022-11-27T16:43:26"
}
To extract a nested field in a NdJSON file we can use the field function in a
mutate step. The following example extracts the sha from the list of
commits in the payload object:
$ dply -c 'json("./tests/data/github.json") |
mutate(commits = field(payload, commits)) |
unnest(commits) |
mutate(sha = field(commits, sha)) |
select(sha) |
show()'
shape: (4, 1)
┌──────────────────────────────────────────┐
│ sha │
│ --- │
│ str │
╞══════════════════════════════════════════╡
│ a02be18dc2a0faa0faec14f50c8b190ca0b50034 │
│ ac97a4ab3a4d86f61a6ba167c06cd8813b470867 │
│ null │
│ e4b233f1323a4b4e4461ed1aad31d20a7fbf0db4 │
└──────────────────────────────────────────┘
Complex NdJSON files can be converted to Parquet for faster query processing:
$ dply -c 'json("github.json") | parquet("github.parquet")'
The following pipeline reads a Parquet file1, group rows by payment_type,
computes the minimum, mean, and maximum fare for each payment type, saves the
result to fares.csv CSV file, and shows the result:
$ dply -c 'parquet("nyctaxi.parquet") |
group_by(payment_type) |
summarize(
min_price = min(total_amount),
mean_price = mean(total_amount),
max_price = max(total_amount)
) |
arrange(payment_type) |
csv("fares.csv") |
show()'
shape: (5, 4)
┌──────────────┬───────────┬────────────┬───────────┐
│ payment_type ┆ min_price ┆ mean_price ┆ max_price │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 │
╞══════════════╪═══════════╪════════════╪═══════════╡
│ Cash ┆ -61.85 ┆ 18.07 ┆ 86.55 │
│ Credit card ┆ 4.56 ┆ 22.969491 ┆ 324.72 │
│ Dispute ┆ -55.6 ┆ -0.145161 ┆ 54.05 │
│ No charge ┆ -16.3 ┆ 0.086667 ┆ 19.8 │
│ Unknown ┆ 9.96 ┆ 28.893333 ┆ 85.02 │
└──────────────┴───────────┴────────────┴───────────┘
Running dply without any parameter starts the interactive client:
dply supports the following functions:
more examples can be found in the tests folder.
Binaries generated by the release Github action for Linux, macOS (x86), and Windows are available in the releases page.
You can also install dply using Cargo:
cargo install dply
or by building it from this repository:
git clone https://github.com/vincev/dply-rs
cd dply-rs
cargo install --path .
The file nyctaxi.parquet in the tests/data folder is a
250 rows parquet file sampled from the NYC trip record data. ↩