Benchmarks ========== Tokenization ------------ Following benchmarks illustrate the tokenization accuracy (F1 score) on `UD treebanks `_ , ======= ========= ========= =========== ======= lang dataset regexp spacy 2.1 vtext ======= ========= ========= =========== ======= en EWT 0.812 0.972 0.966 en GUM 0.881 0.989 0.996 de GSD 0.896 0.944 0.964 fr Sequoia 0.844 0.968 0.971 ======= ========= ========= =========== ======= and the English tokenization speed in million words per second (MWPS) ================== ========== =========== ========== . regexp spacy 2.1 vtext ================== ========== =========== ========== **Speed (MB/s)** 3.1 MWPS 0.14 MWPS 2.1 MWPS ================== ========== =========== ========== Text vectorization ------------------ Below are benchmarks for converting textual data to a sparse document-term matrix using the 20 newsgroups dataset, run on Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80GHz, =============================== ===================== =================== ================= Speed (MB/s) scikit-learn 0.20.1 vtext (n_jobs=1) vtext (n_jobs=4) =============================== ===================== =================== ================= CountVectorizer.fit 14 104 225 CountVectorizer.transform 14 82 303 CountVectorizer.fit_transform 14 70 NA HashingVectorizer.transform 19 89 309 =============================== ===================== =================== ================= Note however that these two estimators in vtext currently support only a fraction of scikit-learn's functionality.