Crates.io | ssam |
lib.rs | ssam |
version | 0.2.0 |
source | src |
created_at | 2020-09-04 21:58:59.081522 |
updated_at | 2021-03-03 13:12:20.591387 |
description | Ssam, short for split sampler, splits one or more text-based input files into multiple sets using random sampling. This is useful for splitting data into a training, test and development sets, or whatever sets you desire. |
homepage | https://github.com/proycon/ssam/ |
repository | https://github.com/proycon/ssam |
max_upload_size | |
id | 284879 |
size | 61,392 |
Ssam, short for split sampler, splits one or more text-based input files into multiple sets using random sampling. This is useful for splitting data into a training, test and development sets, or whatever sets you desire.
--shuffle
for more randomness.Install it using Rust's package manager:
cargo install ssam
No cargo/rust on your system yet? Do sudo apt install cargo
on Debian/ubuntu based systems, brew install rust
on mac, or use rustup.
See ssam --help
for extensive usage information.
Suppose you have a text file sentences.txt
with one sentence per line, and you want to sample the sentences into a test, development and
train set using respectively 10% (0.1
), 10% (0.1
) and the remainder (*
) of the sentences:
$ ssam --sizes "0.1,0.1,*" --names "test,dev,train" sentences.txt
This will output three files: sentences.train.txt
, sentences.test.txt
and sentences.dev.txt
. If you don't specify
any names explicitly the infix will simply be set1
,set2
,set3
, etc..
Suppose you have the same sentences in German in a file called sätze.txt
and the sentences are aligned up nicely with
the ones in sentences.txt
(i.e. the same line numbers correspond and contain translations). You can now make a
dependent split as follows:
$ ssam --shuffle --sizes "0.1,0.1,*" --names "test,dev,train" sentences.txt sätze.txt
The sentences will still correspond in each of the output sets. We also added --shuffle
for more randomness in the
output order, as by default ssam preserves order.
Ssam can also read from stdin (provided you want to supply only one input document). If you're only doing one sample (rather than three as shown above), then it will simply output to stdout.
Rather than using lines as units, you can specify a delimiter manually. For example, set --delimiter ""
(empty
delimiter) if you want empty lines to be the delimiter, such as for instance those often used to separate paragraphs.
Alternative you can set it to an explicit marker in your input, like --delimiter "<utt>"
for example.
In loving memory of our cat Sam, 2009-2019.