Crates.io | lindera-filter |
lib.rs | lindera-filter |
version | 0.32.2 |
source | src |
created_at | 2023-02-16 12:29:59.679471 |
updated_at | 2024-06-29 15:57:54.226842 |
description | Character and token filters for Lindera. |
homepage | https://github.com/lindera-morphology/lindera |
repository | https://github.com/lindera-morphology/lindera |
max_upload_size | |
id | 786710 |
size | 289,567 |
Character and token filters for Lindera.
Normalizes Japanese horizontal iteration marks (odoriji) to their expanded form. Sequences of iteration marks are supported. In case an illegal sequence of iteration marks is encountered, the implementation emits the illegal source character as-is without considering its script. For example, with input "?ゝ", we get "??" even though the question mark isn't hiragana.
Replace characters with the specified character mappings, and correcting the resulting changes to the offsets. Matching is greedy (longest pattern matching at a given point wins). Replacement is allowed to be the empty string.
Character filter that uses a regular expression for the target of replace string.
Unicode normalization to normalize the input text, that using the specified normalization form, one of NFC, NFD, NFKC, or NFKD.
Replace the term text with the base form registered in the morphological dictionary. This acts as a lemmatizer for verbs and adjectives.
Compound consecutive tokens that have specified part-of-speech tags into a single token. This is useful for handling compound words that are not registered in the morphological dictionary.
Normalizes common katakana spelling variations ending with a long sound (U+30FC) by removing that character. Only katakana words longer than the minimum length are stemmed.
Keep only tokens with the specified part-of-speech tag.
Convert tokens representing Japanese numerals, including Kanji numerals, to Arabic numerals.
Replace the text of a token with the reading of the text as registered in the morphological dictionary. The reading is in katakana.
Remove tokens with the specified part-of-speech tag.
Keep only the tokens of the specified text.
Keep only tokens with the specified part-of-speech tag.
Replace the text of a token with the reading of the text as registered in the morphological dictionary.
Remove tokens with the specified part-of-speech tag.
Keep only tokens with the specified number of characters of text.
Normalizes token text to lowercase.
Replace characters with the specified character mappings.
Remove the tokens of the specified text.
Normalizes token text to uppercase.
The API reference is available. Please see following URL: