# Lindera Filter [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Join the chat at https://gitter.im/lindera-morphology/lindera](https://badges.gitter.im/lindera-morphology/lindera.svg)](https://gitter.im/lindera-morphology/lindera?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) [![Crates.io](https://img.shields.io/crates/v/lindera-filter.svg)](https://crates.io/crates/lindera-filter) Character and token filters for [Lindera](https://github.com/lindera-morphology/lindera). ## Character filters ### Japanese iteration mark character filter Normalizes Japanese horizontal [iteration marks](https://en.wikipedia.org/wiki/Iteration_mark) (odoriji) to their expanded form. Sequences of iteration marks are supported. In case an illegal sequence of iteration marks is encountered, the implementation emits the illegal source character as-is without considering its script. For example, with input "?ゝ", we get "??" even though the question mark isn't hiragana. ### Mapping character filter Replace characters with the specified character mappings, and correcting the resulting changes to the offsets. Matching is greedy (longest pattern matching at a given point wins). Replacement is allowed to be the empty string. ### Regex character filter Character filter that uses a regular expression for the target of replace string. ### Unicode normalize character filter Unicode normalization to normalize the input text, that using the specified normalization form, one of NFC, NFD, NFKC, or NFKD. ## Token filters ### Japanese base form token filter Replace the term text with the base form registered in the morphological dictionary. This acts as a lemmatizer for verbs and adjectives. ### Japanese compound word token filter Compound consecutive tokens that have specified part-of-speech tags into a single token. This is useful for handling compound words that are not registered in the morphological dictionary. ### Japanese katakana stem token filter Normalizes common katakana spelling variations ending with a long sound (U+30FC) by removing that character. Only katakana words longer than the minimum length are stemmed. ### Japanese keep tags token filter Keep only tokens with the specified part-of-speech tag. ### Japanese number token filter Convert tokens representing Japanese numerals, including Kanji numerals, to Arabic numerals. ### Japanese reading form token filter Replace the text of a token with the reading of the text as registered in the morphological dictionary. The reading is in katakana. ### Japanese stop tags token filter Remove tokens with the specified part-of-speech tag. ### Keep words token filter Keep only the tokens of the specified text. ### Korean keep tags token filter Keep only tokens with the specified part-of-speech tag. ### Korean reading form token filter Replace the text of a token with the reading of the text as registered in the morphological dictionary. ### Korean stop tags token filter Remove tokens with the specified part-of-speech tag. ### Length token filter Keep only tokens with the specified number of characters of text. ### Lowercase token filter Normalizes token text to lowercase. ### Mapping token filter Replace characters with the specified character mappings. ### Stop words token filter Remove the tokens of the specified text. ### Uppercase token filter Normalizes token text to uppercase. ## API reference The API reference is available. Please see following URL: - [lindera-filter](https://docs.rs/lindera-filter)