Crates.io | csv2parquet |
lib.rs | csv2parquet |
version | |
source | src |
created_at | 2021-03-05 17:00:58.660321+00 |
updated_at | 2025-02-16 15:13:12.486406+00 |
description | Convert CSV files to Parquet |
homepage | https://github.com/domoritz/arrow-tools/tree/main/crates/csv2parquet |
repository | https://github.com/domoritz/arrow-tools |
max_upload_size | |
id | 364426 |
Cargo.toml error: | TOML parse error at line 18, column 1 | 18 | autolib = false | ^^^^^^^ unknown field `autolib`, expected one of `name`, `version`, `edition`, `authors`, `description`, `readme`, `license`, `repository`, `homepage`, `documentation`, `build`, `resolver`, `links`, `default-run`, `default_dash_run`, `rust-version`, `rust_dash_version`, `rust_version`, `license-file`, `license_dash_file`, `license_file`, `licenseFile`, `license_capital_file`, `forced-target`, `forced_dash_target`, `autobins`, `autotests`, `autoexamples`, `autobenches`, `publish`, `metadata`, `keywords`, `categories`, `exclude`, `include` |
size | 0 |
Convert CSV files to Apache Parquet. This package is part of Arrow CLI tools.
You can get the latest releases from https://github.com/domoritz/arrow-tools/releases.
brew install domoritz/homebrew-tap/csv2parquet
cargo install csv2parquet
To avoid re-compilation and speed up installation, you can install this tool with cargo binstall
:
cargo binstall csv2parquet
Usage: csv2parquet [OPTIONS] <CSV> <PARQUET>
Arguments:
<CSV>
Input CSV file, stdin if not present
<PARQUET>
Output file
Options:
-s, --schema-file <SCHEMA_FILE>
File with Arrow schema in JSON format
--max-read-records <MAX_READ_RECORDS>
The number of records to infer the schema from. All rows if not present. Setting max-read-records to zero will stop schema inference and all columns will be string typed
--header <HEADER>
Set whether the CSV file has headers
[default: true]
[possible values: true, false]
--delimiter <DELIMITER>
Set the CSV file's column delimiter as a byte character
--escape <ESCAPE>
Specify an escape character
--quote <QUOTE>
Specify a custom quote character
--comment <COMMENT>
Specify a comment character.
Lines starting with this character will be ignored
--null-regex <NULL_REGEX>
Provide a regex to match null values
-c, --compression <COMPRESSION>
Set the compression
[possible values: uncompressed, snappy, gzip, lzo, brotli, lz4, zstd, lz4-raw]
-e, --encoding <ENCODING>
Sets encoding for any column
[possible values: plain, plain-dictionary, rle, rle-dictionary, delta-binary-packed, delta-length-byte-array, delta-byte-array, byte-stream-split]
--data-page-size-limit <DATA_PAGE_SIZE_LIMIT>
Sets data page size limit
--dictionary-page-size-limit <DICTIONARY_PAGE_SIZE_LIMIT>
Sets dictionary page size limit
--write-batch-size <WRITE_BATCH_SIZE>
Sets write batch size
--max-row-group-size <MAX_ROW_GROUP_SIZE>
Sets max size for a row group
--created-by <CREATED_BY>
Sets "created by" property
--dictionary <DICTIONARY>
Sets flag to enable/disable dictionary encoding for any column
[possible values: true, false]
--statistics <STATISTICS>
Sets flag to enable/disable statistics for any column
[possible values: none, chunk, page]
-p, --print-schema
Print the schema to stderr
-n, --dry
Only print the schema
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
The --schema-file option uses the same file format as --dry and --print-schema.
csv2parquet data.csv data.parquet
header
to Parquetcsv2parquet --header false <CSV> <PARQUET>
schema
from a CSV with headercsv2parquet --header true --dry <CSV> <PARQUET>
schema-file
to ParquetBelow is an example of the schema-file
content:
{
"fields": [
{
"name": "col1",
"data_type": "Utf8",
"nullable": false,
"dict_id": 0,
"dict_is_ordered": false,
"metadata": {}
},
{
"name": " col2",
"data_type": "Utf8",
"nullable": false,
"dict_id": 0,
"dict_is_ordered": false,
"metadata": {}
}
],
" metadata": {}
}
Then add the schema-file schema.json
in the command:
csv2parquet --header false --schema-file schema.json <CSV> <PARQUET>
This technique can prevent you from writing large files to disk. For example, here we stream a CSV file from a URL to S3.
curl <FILE_URL> | csv2parquet /dev/stdin /dev/stdout | aws s3 cp - <S3_DESTINATION>