| Crates.io | corpus-preproc |
| lib.rs | corpus-preproc |
| version | 0.1.0 |
| created_at | 2022-02-06 03:10:03.280819+00 |
| updated_at | 2022-02-06 03:10:03.280819+00 |
| description | A preprocessor for text and HTML corpora |
| homepage | |
| repository | https://github.com/dosjorge/corpus-preproc |
| max_upload_size | |
| id | 527650 |
| size | 467,702 |
CLI and HTTP API to preprocess corpora for word embeddings and possibly other NLP tasks. The main goal is to convert many HTML or plain text files into a single normalized plain text corpus.
<unk> placeholder if they meet any of the following criteria:
@http://# Install
$ cargo install corpus-preproc
# Run CLI help
$ corpus-preproc clean -h
Preprocess a file or directory
USAGE:
corpus-preproc clean [OPTIONS] <INPUT> <OUTPUT>
ARGS:
<INPUT>
<OUTPUT>
OPTIONS:
-c
Clean HTML tags
--content-selector <CONTENT_SELECTOR>
CSS selector for main content
--delete-selector <DELETE_SELECTOR>
CSS selector for tag removal [default: "script, style, pre, svg, math, noscript, ref,
table, tr, td, ol, ul, li, time, [aria-hidden], img, figure"]
-h, --help
Print help information
-l
Perform case-folding
-m
Keep modifiers and marks on normalization
-n
Perform NFKC and whitespace normalization
--nl-append-selector <NL_APPEND_SELECTOR>
CSS selector to append newline [default: "div, p, hr, br, h1, h2, h3, h4, h5, h6"]
-p
Trim punctuation surrounding words
-t <THREADS>
Number of threads to use [default: 4]
$ corpus-preproc serve 127.0.0.1:8000
The requests Python library needs to be installed.
import requests
import json
DEFAULT_CONFIG = {
"htmlClean": {
"enabled": True,
"contentSelector": None,
"deleteSelector": "script, style, pre, svg, math, noscript, ref, table, tr, td, ol, ul, li, time, [aria-hidden], img, figure",
"nlAppendSelector": "div, p, hr, br, h1, h2, h3, h4, h5, h6",
},
"charNormalization": {
"enabled": True,
"keepModifiersAndMarks": False,
"lowercase": True,
},
"wordNormalization": {
"enabled": True,
"replacePii": True,
}
}
def clean_text(text):
files = {
'config': (None, json.dumps(DEFAULT_CONFIG), 'application/json'), # optional
'data': (None, text, 'text/plain'),
}
response = requests.post('http://127.0.0.1:3000/preproc', files=files)
return response.text
clean = clean_text("<b>HELLo, WORLD!!!").rstrip()
assert (clean == "hello world"), "OK"
indicatif with linyatokenizersropey or tendrillol-html and html5ever issue #149pdf-extractdotext or docxrust-stemmersfasttext-rs and a
language identification modelMITIE (Rust bindings missing) or phrase