| Crates.io | pgparquet |
| lib.rs | pgparquet |
| version | 0.1.0 |
| created_at | 2025-07-13 17:29:07.763828+00 |
| updated_at | 2025-07-13 17:29:07.763828+00 |
| description | High-performance CLI tool for streaming Parquet files from Google Cloud Storage into PostgreSQL |
| homepage | |
| repository | https://github.com/JacobHayes/pgparquet |
| max_upload_size | |
| id | 1750652 |
| size | 115,460 |
A quick and dirty CLI to read Parquet files from Google Cloud Storage (GCS) and stream into PostgreSQL. Other storage backends may be added in the future.
The pg_parquet extension is great, but cannot be installed on hosted PostgreSQL providers (eg: GCP). DuckDB can read from parquet and write to PostgreSQL, but it doesn't support Google Application Default Credentials (ADC) for authentication, which makes authentication more challenging.
[!NOTE] This project is a prototype as I learn Rust - there may be bugs or inefficiencies. Feel free to contribute!
cd pgparquet
cargo build --release
Create a new table and load a single parquet file:
pgparquet \
--path gs://my-bucket/data/single-file.parquet \
--database-url "postgresql://user:password@localhost:5432/mydb" \
--table analytics.my_table \
--create-table
Wipe a table and load all parquet files from a folder:
pgparquet \
--path gs://my-bucket/data/parquet-files/ \
--database-url "postgresql://user:password@localhost:5432/mydb" \
--table analytics.my_table \
--truncate
--path, -p: GCS path - end with '.parquet' for a single file or '/' for a folder
gs://bucket/file.parquet or gs://bucket/folder/--database-url, -d: PostgreSQL connection string (required)--table, -t: Target table name in PostgreSQL (can include schema: schema.table)--batch-size: Number of records to process in each batch (default: 1000)--create-table: Create the table if it doesn't exist--truncate: Truncate the table before loading dataYou can also use environment variables for sensitive information or the log level:
export DATABASE_URL="postgresql://user:password@localhost:5432/mydb"
export RUST_LOG=info # Set logging level
The tool automatically maps Arrow/Parquet data types to PostgreSQL types:
| Arrow/Parquet Type | PostgreSQL Type |
|---|---|
| Boolean | BOOLEAN |
| Int8, Int16 | SMALLINT |
| Int32 | INTEGER |
| Int64 | BIGINT |
| UInt64 | NUMERIC |
| Float32 | REAL |
| Float64 | DOUBLE PRECISION |
| Utf8, LargeUtf8 | TEXT |
| Binary, LargeBinary | BYTEA |
| Date32, Date64 | DATE |
| Time32, Time64 | TIME |
| Timestamp | TIMESTAMP |
| Decimal128, Decimal256 | NUMERIC |
| List, Struct, Map | JSONB |
shared_buffersmaintenance_work_memcheckpoint_segmentsVerify your Google Cloud credentials:
gcloud auth application-default print-access-token
Check your user or service account has the necessary permissions:
storage.objects.getstorage.objects.listVerify connection string format:
postgresql://[user[:password]@][host][:port][/dbname][?param1=value1&...]
Test connection manually:
psql "postgresql://user:password@host:5432/dbname"
This project is licensed under the MIT License.