Crates.io | corpus-preproc |
lib.rs | corpus-preproc |
version | 0.1.0 |
source | src |
created_at | 2022-02-06 03:10:03.280819 |
updated_at | 2022-02-06 03:10:03.280819 |
description | A preprocessor for text and HTML corpora |
homepage | |
repository | https://github.com/dosjorge/corpus-preproc |
max_upload_size | |
id | 527650 |
size | 467,702 |
CLI and HTTP API to preprocess corpora for word embeddings and possibly other NLP tasks. The main goal is to convert many HTML or plain text files into a single normalized plain text corpus.
<unk>
placeholder if they meet any of the following criteria:
@
http://
# Install
$ cargo install corpus-preproc
# Run CLI help
$ corpus-preproc clean -h
Preprocess a file or directory
USAGE:
corpus-preproc clean [OPTIONS] <INPUT> <OUTPUT>
ARGS:
<INPUT>
<OUTPUT>
OPTIONS:
-c
Clean HTML tags
--content-selector <CONTENT_SELECTOR>
CSS selector for main content
--delete-selector <DELETE_SELECTOR>
CSS selector for tag removal [default: "script, style, pre, svg, math, noscript, ref,
table, tr, td, ol, ul, li, time, [aria-hidden], img, figure"]
-h, --help
Print help information
-l
Perform case-folding
-m
Keep modifiers and marks on normalization
-n
Perform NFKC and whitespace normalization
--nl-append-selector <NL_APPEND_SELECTOR>
CSS selector to append newline [default: "div, p, hr, br, h1, h2, h3, h4, h5, h6"]
-p
Trim punctuation surrounding words
-t <THREADS>
Number of threads to use [default: 4]
$ corpus-preproc serve 127.0.0.1:8000
The requests
Python library needs to be installed.
import requests
import json
DEFAULT_CONFIG = {
"htmlClean": {
"enabled": True,
"contentSelector": None,
"deleteSelector": "script, style, pre, svg, math, noscript, ref, table, tr, td, ol, ul, li, time, [aria-hidden], img, figure",
"nlAppendSelector": "div, p, hr, br, h1, h2, h3, h4, h5, h6",
},
"charNormalization": {
"enabled": True,
"keepModifiersAndMarks": False,
"lowercase": True,
},
"wordNormalization": {
"enabled": True,
"replacePii": True,
}
}
def clean_text(text):
files = {
'config': (None, json.dumps(DEFAULT_CONFIG), 'application/json'), # optional
'data': (None, text, 'text/plain'),
}
response = requests.post('http://127.0.0.1:3000/preproc', files=files)
return response.text
clean = clean_text("<b>HELLo, WORLD!!!").rstrip()
assert (clean == "hello world"), "OK"
indicatif
with linya
tokenizers
ropey
or tendril
lol-html
and html5ever
issue #149pdf-extract
dotext
or docx
rust-stemmers
fasttext-rs
and a
language identification modelMITIE
(Rust bindings missing) or phrase