# CSV to Parquet [![Crates.io](https://img.shields.io/crates/v/csv2parquet.svg)](https://crates.io/crates/csv2parquet) Convert CSV files to [Apache Parquet](https://parquet.apache.org/). This package is part of [Arrow CLI tools](https://github.com/domoritz/arrow-tools). ## Installation ### Download prebuilt binaries You can get the latest releases from https://github.com/domoritz/arrow-tools/releases. ### With Homebrew ``` brew install domoritz/homebrew-tap/csv2parquet ``` ### With Cargo ``` cargo install csv2parquet ``` ## With [Cargo B(inary)Install](https://github.com/cargo-bins/cargo-binstall) To avoid re-compilation and speed up installation, you can install this tool with `cargo binstall`: ``` cargo binstall csv2parquet ``` ## Usage ``` Usage: csv2parquet [OPTIONS] Arguments: Input CSV fil, stdin if not present Output file Options: -s, --schema-file File with Arrow schema in JSON format --max-read-records The number of records to infer the schema from. All rows if not present. Setting max-read-records to zero will stop schema inference and all columns will be string typed --header
Set whether the CSV file has headers [possible values: true, false] -d, --delimiter Set the CSV file's column delimiter as a byte character [default: ,] -c, --compression Set the compression [possible values: uncompressed, snappy, gzip, lzo, brotli, lz4, zstd, lz4-raw] -e, --encoding Sets encoding for any column [possible values: plain, plain-dictionary, rle, rle-dictionary, delta-binary-packed, delta-length-byte-array, delta-byte-array, byte-stream-split] --data-page-size-limit Sets data page size limit --dictionary-page-size-limit Sets dictionary page size limit --write-batch-size Sets write batch size --max-row-group-size Sets max size for a row group --created-by Sets "created by" property --dictionary Sets flag to enable/disable dictionary encoding for any column --statistics Sets flag to enable/disable statistics for any column [possible values: none, chunk, page] --max-statistics-size Sets max statistics size for any column. Applicable only if statistics are enabled -p, --print-schema Print the schema to stderr -n, --dry Only print the schema -h, --help Print help -V, --version Print version ``` The --schema-file option uses the same file format as --dry and --print-schema. ## Examples ### Convert a CSV to Parquet ```bash csv2parquet data.csv data.parquet ``` ### Convert a CSV with no `header` to Parquet ```bash csv2parquet --header false ``` ### Get the `schema` from a CSV with header ```bash csv2parquet --header true --dry ``` ### Convert a CSV using `schema-file` to Parquet Below is an example of the `schema-file` content: ```json { "fields": [ { "name": "col1", "data_type": "Utf8", "nullable": false, "dict_id": 0, "dict_is_ordered": false, "metadata": {} }, { "name": " col2", "data_type": "Utf8", "nullable": false, "dict_id": 0, "dict_is_ordered": false, "metadata": {} } ], " metadata": {} } ``` Then add the schema-file `schema.json` in the command: ``` csv2parquet --header false --schema-file schema.json ``` ### Convert streams piping from standard input to standard output This technique can prevent you from writing large files to disk. For example, here we stream a CSV file from a URL to S3. ```bash curl | csv2parquet /dev/stdin /dev/stdout | aws s3 cp - ```