Crates.io | joinkit |
lib.rs | joinkit |
version | 0.2.0 |
source | src |
created_at | 2015-12-02 14:54:03.286551 |
updated_at | 2016-05-03 17:46:38.538608 |
description | Iterator adaptors for efficient SQL-like joins |
homepage | |
repository | https://github.com/milancio42/joinkit |
max_upload_size | |
id | 3559 |
size | 95,199 |
Iterator adaptors for efficient SQL-like joins. The library is documented here.
To use it, put the following code to your Cargo.toml:
[dependencies]
joinkit = "*"
and then include the following code in your crate:
extern crate joinkit;
use joinkit::Joinkit;
This crate provides two binaries: hjoin
and mjoin
, which can be used
to join data on command line using Hash Join
and Merge Join
strategy respectively.
See the documentation to learn more about the join strategies.
You can also run hjoin --help
or mjoin --help
to learn about their usage.
Prepare test data:
data_path=/tmp/join
if ! [[ -d $data_path ]]; then mkdir -p $data_path; fi
cd $data_path
gawk 'BEGIN{n=20;for(i=0;i<n;i++){print i ",L"}}' > left-num-20
gawk 'BEGIN{n=20;for(i=(n/2);i<(n+n/2);i++){print i ",R"}}' > right-num-20
gawk 'BEGIN{n=20;for(i=0;i<n;i++){print i ",L"}}' | sort -t , -k 1,1 > left-char-20
gawk 'BEGIN{n=20;for(i=(n/2);i<(n+n/2);i++){print i ",R"}}' | sort -t , -k 1,1 > right-char-20
gawk 'BEGIN{n=1000000;for(i=0;i<n;i++){print i ",L"}}' > left-num-1M
gawk 'BEGIN{n=1000000;for(i=(n/2);i<(n+n/2);i++){print i ",R"}}' > right-num-1M
gawk 'BEGIN{n=1000000;for(i=0;i<n;i++){print i ",L"}}' | sort -t , -k 1,1 > left-char-1M
gawk 'BEGIN{n=1000000;for(i=(n/2);i<(n+n/2);i++){print i ",R"}}' | sort -t , -k 1,1 > right-char-1M
clone repository:
cd ~/some/local/path
git clone https://github.com/milancio42/joinkit.git
cd joinkit
cargo build --release
cd target/release
The output contains only the rows, which have the key present in both input files.
The join key in the left file is composed by the second and the first column, whereas the join key in the right file is composed by the first and the second column (the order is important).
Note, in case of hjoin
, the right input data is loaded into HashMap
.
./hjoin -1 1 -2 1 $data_path/left-char-20 $data_path/right-char-20
# in order to join on numeric data, use '-u' flag to convert a string to 'u64' (or '-i' to 'i64')
./hjoin -1 1-u -2 1-u $data_path/left-num-20 $data_path/right-num-20
This is equivalent to:
./hjoin -1 1 -2 1 -m inner -R $'\n' -F ',' $data_path/left-char-20 $data_path/right-char-20
./hjoin -1 1 -2 1 --mode inner --in-rec-sep $'\n' --in-field_sep ',' --out-rec-sep $'\n' --out-field-sep ',' $data_path/left-char-20 $data_path/right-char-20
./hjoin -1 1 -2 1 --mode inner --in-rec-sep-left $'\n' --in-rec-sep-right $'\n' --in-field_sep-left ',' --in-field_sep-right ',' --out-rec-sep $'\n' --out-field-sep ',' $data_path/left-char-20 $data_path/right-char-20
Since both input files are sorted on the join key, we can get the same results using mjoin
:
./mjoin -1 1 -2 1 $data_path/left-char-20 $data_path/right-char-20
The output contains only the rows, which have the key present in the left input file exclusively.
./hjoin -1 1 -2 1 -m left-excl $data_path/left-char-20 $data_path/right-char-20
./mjoin -1 1 -2 1 -m left-excl $data_path/left-char-20 $data_path/right-char-20
The output contains the rows, which are union of inner join
and left exclusive join
.
./hjoin -1 1 -2 1 -m left-outer $data_path/left-char-20 $data_path/right-char-20
./mjoin -1 1 -2 1 -m left-outer $data_path/left-char-20 $data_path/right-char-20
The output contains only the rows, which have the key present in the right
input file exclusively.
Note, in case of hjoin
, the output is ordered based on HashMap
's internal
ordering, which is very likely different from that of the input.
./hjoin -1 1 -2 1 -m right-excl $data_path/left-char-20 $data_path/right-char-20
./mjoin -1 1 -2 1 -m right-excl $data_path/left-char-20 $data_path/right-char-20
The output contains the rows, which are union of inner join
and right exclusive join
.
Note, in case of hjoin
, the output is ordered based on HashMap
's internal
ordering, which is very likely different from that of the input.
./hjoin -1 1 -2 1 -m right-outer $data_path/left-char-20 $data_path/right-char-20
./mjoin -1 1 -2 1 -m right-outer $data_path/left-char-20 $data_path/right-char-20
The output contains the rows, which are union of left exclusive join
, inner join
and right exclusive join
.
Note, in case of hjoin
, the output is ordered based on HashMap
's internal
ordering, which is very likely different from that of the input.
./hjoin -1 1 -2 1 -m full-outer $data_path/left-char-20 $data_path/right-char-20
./mjoin -1 1 -2 1 -m full-outer $data_path/left-char-20 $data_path/right-char-20
TODO
Joinkit is licenced under MIT license.