tpchgen-cli

Crates.iotpchgen-cli
lib.rstpchgen-cli
version2.0.1
created_at2025-03-30 22:39:26.975231+00
updated_at2025-09-08 18:24:32.613966+00
descriptionBlazing fast pure Rust TPC-H data generator command line tool.
homepagehttps://github.com/clflushopt/tpchgen-rs
repositoryhttps://github.com/clflushopt/tpchgen-rs
max_upload_size
id1612842
size159,221
@clflushopt (clflushopt)

documentation

README

TPC-H Data Generator CLI

tpchgen-cli is a high-performance, parallel TPC-H data generator command line tool

This tool is more than 10x faster than the next fastest TPCH generator we know of (duckdb). On a 2023 Mac M3 Max laptop, it easily generates data faster than can be written to SSD. See BENCHMARKS.md for more details on performance and benchmarking.

  • See the tpchgen README.md for project details
  • Watch this awesome demo by @alamb to see tpchgen-cli in action
  • Read the companion blog post in the Datafusion blog to learn about the project's history
  • Try it yourself by following the instructions below

Install via pip

pip install tpchgen-cli

Install via Rust

Install Rust and compile

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
RUSTFLAGS='-C target-cpu=native' cargo install tpchgen-cli

Examples

# Scale Factor 10, all tables, in Apache Parquet format in the current directory
# (3.6GB, 8 files, 60M lineitem rows, in 5 seconds on a modern laptop)
tpchgen-cli -s 10 --format=parquet

# Scale Factor 10, all tables, in `tbl`(csv like) format in the `sf10` directory
# (10GB, 8 files, 60M lineitem rows)
tpchgen-cli -s 10 --output-dir sf10

# Scale Factor 1000, lineitem table, in Apache Parquet format in sf1000 directory, 
# 20 part(ititons), 100MB row groups
# (220GB, 20 files, 6B lineitem rows, 3.5 minutes on a modern laptop)
tpchgen-cli -s 1000 --tables lineitem --parts 20 --format=parquet --parquet-row-group-bytes=100000000 --output-dir sf1000

# Scale Factor 10, partition 2 and 3 of 10 in sf10 directory
#
# partitioned/
# ├── lineitem
# │   ├── lineitem.2.tbl
# │   └── lineitem.3.tbl
# └── orders
#    ├── orders.2.tbl
#    └── orders.3.tbl
#     
for PART in `seq 2 3`; do
  tpchgen-cli --tables lineitem,orders --scale-factor=10 --output-dir partitioned --parts 10 --part $PART
done

Performance

Scale Factor tpchgen-cli DuckDB DuckDB (proprietary)
1 0:02.24 0:12.29 0:10.68
10 0:09.97 1:46.80 1:41.14
100 1:14.22 17:48.27 16:40.88
1000 10:26.26 N/A (OOM) N/A (OOM)
  • DuckDB (proprietary) is the time required to create TPCH data using the proprietary DuckDB format
  • Creating Scale Factor 1000 data in DuckDB required 647 GB of memory, which is why it is not included in the table above.

Times to create TPCH tables in Parquet format using tpchgen-cli and duckdb for various scale factors.

Commit count: 203

cargo fmt