Crates.io | corpus-count |
lib.rs | corpus-count |
version | 0.1.1 |
source | src |
created_at | 2019-11-29 13:14:19.171052 |
updated_at | 2019-11-29 14:01:36.596741 |
description | Util to count words and character ngrams in a corpus. |
homepage | https://github.com/sebpuetz/corpus-count |
repository | https://github.com/sebpuetz/corpus-count |
max_upload_size | |
id | 185296 |
size | 18,432 |
Small util to count tokens and optionally character ngrams in a whitespace tokenized corpus.
Outputs frequency-sorted lists of items.
# read from file, write ngram and token counts to files
$ corpus-count -c /path/to/corpus.txt -n /path/to/ngram_output.txt \
-t /path/to/token_output.txt
# read from file, don't count ngrams and write token counts to stdout
$ corpus-count -c /path/to/corpus.txt
# read from stdin, don't count ngrams and write token counts to stdout
$ corpus-count < /path/to/corpus.txt
# read from file, write ngram and token counts to files, filter tokens and
# ngrams appearing less than 30 times. ngrams are counted **before** filtering
# tokens.
$ corpus-count -c /path/to/corpus.txt -n /path/to/ngram_output.txt \
-t /path/to/token_output.txt --token_min 30 --ngram_min 30
# read from file, write ngram and token counts to files, filter out tokens and
# ngrams appearing less than 30 times. Count ngrams **after** filtering tokens.
$ corpus-count -c /path/to/corpus.txt -n /path/to/ngram_output.txt \
-t /path/to/token_output.txt --token_min 30 --ngram_min 30 --filter_first
Counting ngrams is determined by giving an argument to the --ngram_count
or
-n
flag. Without the --filter_first
flag, the ngram counts are determined
before filtering tokens, therefore tokens which appear less than
--token_min
times can still contribute to the count of an ngram. If this flag
is set, tokens are filtered first and only in-vocabulary tokens influence the
counts of ngrams.
Per default, tokens are bracketed with "<" and ">" before extracting ngrams.
This does not affect the tokens, only ngrams and can be toggled through the
--no_bracket
flag.
Minimum and maximum ngram length can be set through the respective --min_n
and --max_n
flags.
Rust is required, most easily installed through https://rustup.rs.
cargo install corpus-count