corpus-count

Crates.iocorpus-count
lib.rscorpus-count
version0.1.1
sourcesrc
created_at2019-11-29 13:14:19.171052
updated_at2019-11-29 14:01:36.596741
descriptionUtil to count words and character ngrams in a corpus.
homepagehttps://github.com/sebpuetz/corpus-count
repositoryhttps://github.com/sebpuetz/corpus-count
max_upload_size
id185296
size18,432
Sebastian Pütz (sebpuetz)

documentation

README

corpus-count

Small util to count tokens and optionally character ngrams in a whitespace tokenized corpus.

Outputs frequency-sorted lists of items.

Usage

# read from file, write ngram and token counts to files
$ corpus-count -c /path/to/corpus.txt -n /path/to/ngram_output.txt \
    -t /path/to/token_output.txt

# read from file, don't count ngrams and write token counts to stdout
$ corpus-count -c /path/to/corpus.txt

# read from stdin, don't count ngrams and write token counts to stdout
$ corpus-count < /path/to/corpus.txt

# read from file, write ngram and token counts to files, filter tokens and
# ngrams appearing less than 30 times. ngrams are counted **before** filtering
# tokens.
$ corpus-count -c /path/to/corpus.txt -n /path/to/ngram_output.txt \
    -t /path/to/token_output.txt --token_min 30 --ngram_min 30
    

# read from file, write ngram and token counts to files, filter out tokens and
# ngrams appearing less than 30 times. Count ngrams **after** filtering tokens.
$ corpus-count -c /path/to/corpus.txt -n /path/to/ngram_output.txt \
    -t /path/to/token_output.txt --token_min 30 --ngram_min 30 --filter_first

Counting ngrams is determined by giving an argument to the --ngram_count or -n flag. Without the --filter_first flag, the ngram counts are determined before filtering tokens, therefore tokens which appear less than --token_min times can still contribute to the count of an ngram. If this flag is set, tokens are filtered first and only in-vocabulary tokens influence the counts of ngrams.

Per default, tokens are bracketed with "<" and ">" before extracting ngrams. This does not affect the tokens, only ngrams and can be toggled through the --no_bracket flag.

Minimum and maximum ngram length can be set through the respective --min_n and --max_n flags.

Install

Rust is required, most easily installed through https://rustup.rs.

cargo install corpus-count 
Commit count: 6

cargo fmt