Crates.io | datahobbit |
lib.rs | datahobbit |
version | 1.0.0 |
source | src |
created_at | 2024-11-19 20:07:51.863362 |
updated_at | 2024-11-19 20:07:51.863362 |
description | A tool that generates CSV or Parquet files with synthetic data based on a provided JSON schema |
homepage | |
repository | |
max_upload_size | |
id | 1453822 |
size | 67,590 |
A Rust command-line tool that generates CSV or Parquet files with synthetic data based on a provided JSON schema. It supports custom delimiters for CSV, displays a progress bar during generation, and efficiently handles large datasets using parallel processing.
To build and run the CSV and Parquet Generator, you need to have Rust installed on your system.
Clone the Repository
git clone https://github.com/yourusername/datahobbit.git
cd datahobbit
Build the Project
cargo build --release
This will create an executable in the target/release
directory.
Cargo
cargo add csv_generator
Run the executable with the following options:
USAGE:
datahobbit [OPTIONS] <input> <output>
ARGS:
<input> Sets the input JSON schema file
<output> Sets the output file (either .csv or .parquet)
OPTIONS:
-d, --delimiter <DELIMITER> Sets the delimiter to use in the CSV file (default is ',')
-h, --help Print help information
-r, --records <RECORDS> Sets the number of records to generate
--format <FORMAT> Sets the output format: either "csv" or "parquet" (default is "csv")
--max-file-size <MAX_FILE_SIZE> Sets the maximum file size for Parquet files in bytes (default is 512 MB)
-V, --version Print version information
The JSON schema defines the structure of the output file, including column names and data types. Here is an example schema:
{
"columns": [
{ "name": "id", "type": "integer" },
{ "name": "first_name", "type": "first_name" },
{ "name": "last_name", "type": "last_name" },
{ "name": "email", "type": "email" },
{ "name": "phone_number", "type": "phone_number" },
{ "name": "age", "type": "integer" },
{ "name": "bio", "type": "sentence" },
{ "name": "is_active", "type": "boolean" }
]
}
Generate a CSV with Default Settings
cargo run -- schema.json output.csv --records 100000
Generate a Parquet File
cargo run -- schema.json output.parquet --records 100000 --format parquet
Generate a Parquet File with Custom Size Limit
cargo run -- input_schema.json output.parquet --records 1000000 --format parquet --max-file-size 10485760
Generates 1,000,000 records. Outputs data in Parquet format. Uses a maximum file size of 10 MB, creating additional files as needed.
Generate a CSV with a Custom Delimiter
cargo run -- input_schema.json output.csv --records 100000 --delimiter ';'
;
) as the delimiter.Display Help Information
cargo run -- --help
The following data types are supported in the schema:
integer
: Generates random integers between 0 and 1000.float
: Generates random floating-point numbers between 0.0 and 1000.0.string
: Generates random words.boolean
: Generates random boolean values (true
or false
).name
: Generates full names.first_name
: Generates first names.last_name
: Generates last names.email
: Generates email addresses.password
: Generates passwords with lengths between 8 and 16 characters.sentence
: Generates sentences containing 5 to 10 words.phone_number
: Generates phone numbers.Example Usage in Schema
{ "name": "age", "type": "integer" }
{ "name": "description", "type": "sentence" }
{ "name": "is_verified", "type": "boolean" }
This project is licensed under the MIT License.
Author: Daniel Beach (dancrystalbeach@gmail.com)
Version: 1.0