Crates.io | vtext |
lib.rs | vtext |
version | 0.2.0 |
source | src |
created_at | 2019-04-13 10:56:14.237978 |
updated_at | 2020-06-14 21:38:29.772129 |
description | NLP with Rust |
homepage | |
repository | https://github.com/rth/vtext |
max_upload_size | |
id | 127640 |
size | 136,131 |
NLP in Rust with Python bindings
This package aims to provide a high performance toolkit for ingesting textual data for machine learning applications.
CountVectorizer
and
HashingVectorizer
in scikit-learn but will less broad functionality.vtext requires Python 3.6+ and can be installed with,
pip install vtext
Below is a simple tokenization example,
>>> from vtext.tokenize import VTextTokenizer
>>> VTextTokenizer("en").tokenize("Flights can't depart after 2:00 pm.")
["Flights", "ca", "n't", "depart" "after", "2:00", "pm", "."]
For more details see the project documentation: vtext.io/doc/latest/index.html
Add the following to Cargo.toml
,
[dependencies]
vtext = "0.2.0"
For more details see rust documentation: docs.rs/vtext
Following benchmarks illustrate the tokenization accuracy (F1 score) on UD treebanks ,
lang | dataset | regexp | spacy 2.1 | vtext |
---|---|---|---|---|
en | EWT | 0.812 | 0.972 | 0.966 |
en | GUM | 0.881 | 0.989 | 0.996 |
de | GSD | 0.896 | 0.944 | 0.964 |
fr | Sequoia | 0.844 | 0.968 | 0.971 |
and the English tokenization speed,
regexp | spacy 2.1 | vtext | |
---|---|---|---|
Speed (10⁶ tokens/s) | 3.1 | 0.14 | 2.1 |
Below are benchmarks for converting textual data to a sparse document-term matrix using the 20 newsgroups dataset, run on Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80GHz,
Speed (MB/s) | scikit-learn 0.20.1 | vtext (n_jobs=1) | vtext (n_jobs=4) |
---|---|---|---|
CountVectorizer.fit | 14 | 104 | 225 |
CountVectorizer.transform | 14 | 82 | 303 |
CountVectorizer.fit_transform | 14 | 70 | NA |
HashingVectorizer.transform | 19 | 89 | 309 |
Note however that these two estimators in vtext currently support only a fraction of scikit-learn's functionality. See benchmarks/README.md for more details.
vtext is released under the Apache License, Version 2.0.