| Crates.io | lindera-python |
| lib.rs | lindera-python |
| version | 1.1.0 |
| created_at | 2025-09-11 04:35:36.355997+00 |
| updated_at | 2025-09-14 15:27:18.637502+00 |
| description | Python binding for Lindera. |
| homepage | https://github.com/lindera/lindera-python |
| repository | https://github.com/lindera/lindera-python |
| max_upload_size | |
| id | 1833330 |
| size | 487,328 |
Python binding for Lindera, a Japanese morphological analysis engine.
lindera-python provides a comprehensive Python interface to the Lindera 1.1.1 morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:
Character Filters:
Token Filters:
# Install Python
% pyenv install 3.13.5
# Clone lindera-python project repository
% git clone git@github.com:lindera/lindera-python.git
% cd lindera-python
# Set Python version for this project
% pyenv local 3.13.5
# Make Python virtual environment
% python -m venv .venv
# Activate Python virtual environment
% source .venv/bin/activate
# Initialize lindera-python project
(.venv) % make init
This command takes a long time because it builds a library that includes all the dictionaries.
(.venv) % make develop
from lindera import TokenizerBuilder
# Create a tokenizer with default settings
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")
tokenizer = builder.build()
# Tokenize Japanese text
text = "すもももももももものうち"
tokens = tokenizer.tokenize(text)
for token in tokens:
print(f"Text: {token.text}, Position: {token.position}")
from lindera import TokenizerBuilder
# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")
# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
# Build tokenizer with filters
tokenizer = builder.build()
text = "テストー123"
tokens = tokenizer.tokenize(text) # Will apply filters automatically
from lindera import TokenizerBuilder
# Create tokenizer builder
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")
# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {"tags": ["助詞", "助動詞"]})
# Build tokenizer with filters
tokenizer = builder.build()
tokens = tokenizer.tokenize("テキストの解析")
from lindera import TokenizerBuilder
# Build tokenizer with integrated filters
builder = TokenizerBuilder()
builder.set_mode("normal")
builder.set_dictionary("embedded://ipadic")
# Add character filters
builder.append_character_filter("mapping", {"mapping": {"ー": "-"}})
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
# Add token filters
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")
# Build and use
tokenizer = builder.build()
tokens = tokenizer.tokenize("コーヒーショップ")
from lindera import Metadata
# Get metadata for a specific dictionary
metadata = Metadata.load("embedded://ipadic")
print(f"Dictionary: {metadata.dictionary_name}")
print(f"Version: {metadata.dictionary_version}")
# Access schema information
schema = metadata.dictionary_schema
print(f"Schema has {len(schema.fields)} fields")
print(f"Fields: {schema.fields[:5]}") # First 5 fields
Character filters and token filters accept configuration as dictionary arguments:
from lindera import TokenizerBuilder
builder = TokenizerBuilder()
builder.set_dictionary("embedded://ipadic")
# Character filters with dict configuration
builder.append_character_filter("unicode_normalize", {"kind": "nfkc"})
builder.append_character_filter("japanese_iteration_mark", {
"normalize_kanji": "true",
"normalize_kana": "true"
})
builder.append_character_filter("mapping", {
"mapping": {"リンデラ": "lindera", "トウキョウ": "東京"}
})
# Token filters with dict configuration
builder.append_token_filter("japanese_katakana_stem", {"min": 3})
builder.append_token_filter("length", {"min": 2, "max": 10})
builder.append_token_filter("japanese_stop_tags", {
"tags": ["助詞", "助動詞", "記号"]
})
# Filters without configuration can omit the dict
builder.append_token_filter("lowercase")
builder.append_token_filter("japanese_base_form")
tokenizer = builder.build()
See examples/ directory for comprehensive examples including:
tokenize.py: Basic tokenizationtokenize_with_filters.py: Using character and token filterstokenize_with_userdict.py: Custom user dictionaryTokenizerBuilder: Fluent builder for tokenizer configurationTokenizer: Main tokenization engineToken: Individual token with text, position, and linguistic featuresCharacterFilter: Text preprocessing filtersTokenFilter: Token post-processing filtersMetadata: Dictionary metadata and configurationSchema: Dictionary schema definitionSee the test_basic.py file for comprehensive API usage examples.