Crates.io | rhuffle |
lib.rs | rhuffle |
version | 0.3.3 |
source | src |
created_at | 2020-03-25 06:37:36.706877 |
updated_at | 2022-07-14 01:07:16.581307 |
description | Random shuffler for large file with many lines |
homepage | |
repository | https://github.com/ctylim/rhuffle |
max_upload_size | |
id | 222573 |
size | 30,034 |
rhuffle is a random shuffler for large file with many lines which can exceed available RAM.
rhuffle supports:
See lib.rs.
USAGE:
rhuffle [OPTIONS]
FLAGS:
--help Prints help information
-V, --version Prints version information
OPTIONS:
-b, --buf <NUMBER>
Sets buffer size which is smaller than available RAM with bytes (default: 4294967296).
--dst <PATH>
Sets destination file path. If not set, destination sets to stdout. (default: None)
--feed <LF|LF_CRLF> Sets acceptable line feed as EOL (default: LF_CRLF).
-h, --head <NUMBER>
Sets first `n` lines without shuffling (default: 0). For multiple input sources, take README a look.
--log <off|error|warn|info|debug|trace> Sets log level. (default: off)
--src <[PATH]>
Sets source file paths (space separated). If not set, source sets to stdin. (default: None)
--head n
Optionn
lines in the first input source forwards to output source without shuffling.n
lines in the first input source are skipped.in1.txt
head1-1
head2-1
line1-1
line2-1
in2.txt
head1-2
head2-2
line1-2
line2-2
$ rhuffle --src in1.txt in2.txt --dst out.txt --head 2
out.txt
head1-1 // L1-L2: fixed
head2-1
line2-1 // L3-L6: shuffled globally
line1-2
line2-2
line1-1
--feed
OptionThe results shown below are focused on execution time in a limited memory space. Two datasets are used for testing.
Three softwares are used for performance comparison.
shuf {src} -o {dst}
terashuf < {src} > {dst}
rhuffle --src {src} --dst {dst}
Benchmarks are executed on MacBook Pro 2017, Core i7 3.1GHz, RAM 16GB.
Execution time is measured by time
.
5.3GB size, 55423856 lines
Software | real | user | sys |
---|---|---|---|
GNU shuf | 0m59s | 0m34s | 0m14s |
terashuf | 5m06s | 4m43s | 0m14s |
rhuffle | 1m56s | 1m06s | 0m40s |
9.0GB size, 21550072 lines
Software | real | user | sys |
---|---|---|---|
GNU shuf | x | x | x |
terashuf | 8m12s | 7m16s | 0m31s |
rhuffle | 1m47s | 0m39s | 0m51s |
GNU shuf was impossible to measure (very slow).