# Corpus Preprocessor [![Build binary](https://github.com/dosjorge/corpus-preproc/actions/workflows/release.yml/badge.svg)](https://github.com/dosjorge/corpus-preproc/actions/workflows/release.yml) CLI and HTTP API to preprocess corpora for word embeddings and possibly other NLP tasks. The main goal is to convert many HTML or plain text files into a single normalized plain text corpus. ## Features - Parallel processing of files in a directory (CLI only) - NKFC and whitespace normalization - Removal of modifiers and marks - Lower-case folding - Trimming of punctuation around words - Replace words with `` placeholder if they meet any of the following criteria: - Word has an at sign `@` - Word lacks alphabetic characters - Word has two punctuation chars in a row, such as `http://` - HTML code is parsed and CSS selectors can be used to: - Remove undesired elements - Insert newlines after paragraphs and line breaks - Extract the main content of an HTML document - Text is automatically converted to UTF-8 if the original encoding is in the [Encoding Standard](https://encoding.spec.whatwg.org/#names-and-labels). ## Usage ### Command Line Interface (CLI) ```console # Install $ cargo install corpus-preproc # Run CLI help $ corpus-preproc clean -h Preprocess a file or directory USAGE: corpus-preproc clean [OPTIONS] ARGS: OPTIONS: -c Clean HTML tags --content-selector CSS selector for main content --delete-selector CSS selector for tag removal [default: "script, style, pre, svg, math, noscript, ref, table, tr, td, ol, ul, li, time, [aria-hidden], img, figure"] -h, --help Print help information -l Perform case-folding -m Keep modifiers and marks on normalization -n Perform NFKC and whitespace normalization --nl-append-selector CSS selector to append newline [default: "div, p, hr, br, h1, h2, h3, h4, h5, h6"] -p Trim punctuation surrounding words -t Number of threads to use [default: 4] ``` ### HTTP API #### Startup ```console $ corpus-preproc serve 127.0.0.1:8000 ``` #### Python Example The [`requests`](https://docs.python-requests.org/en/latest/user/install/) Python library needs to be installed. ```python import requests import json DEFAULT_CONFIG = { "htmlClean": { "enabled": True, "contentSelector": None, "deleteSelector": "script, style, pre, svg, math, noscript, ref, table, tr, td, ol, ul, li, time, [aria-hidden], img, figure", "nlAppendSelector": "div, p, hr, br, h1, h2, h3, h4, h5, h6", }, "charNormalization": { "enabled": True, "keepModifiersAndMarks": False, "lowercase": True, }, "wordNormalization": { "enabled": True, "replacePii": True, } } def clean_text(text): files = { 'config': (None, json.dumps(DEFAULT_CONFIG), 'application/json'), # optional 'data': (None, text, 'text/plain'), } response = requests.post('http://127.0.0.1:3000/preproc', files=files) return response.text clean = clean_text("HELLo, WORLD!!!").rstrip() assert (clean == "hello world"), "OK" ``` ## TODO - [ ] Normalize or remove inner word separators - [ ] Replace `indicatif` with `linya` - [ ] Export and load CLI options as JSON files ## Wishlist ### Speed - [ ] Use the efficient plain text preprocessors of `tokenizers` - [ ] Use a better text data structure such as `ropey` or `tendril` - [ ] Determine feasibility to process text as a stream instead of loading entire file buffer into memory - See `lol-html` and `html5ever` issue [#149](https://github.com/servo/html5ever/issues/149) ### Functionality - [ ] Implement quality control (minimum and maximum sentence length) - [ ] Implement pdf text extractor with `pdf-extract` - [ ] Implement docx/pptx/odt text extractor with `dotext` or `docx` - [ ] Implement stemmer with `rust-stemmers` - [ ] Implement sentence filtering based on desired language with `fasttext-rs` and a [language identification model](https://fasttext.cc/blog/2017/10/02/blog-post.html) - [ ] Automatically concatenate common MWEs with `MITIE` (Rust bindings missing) or `phrase` ### Interoperability - [ ] Python bindings