| Crates.io | matcher_py |
| lib.rs | matcher_py |
| version | 0.5.8 |
| created_at | 2024-06-15 12:05:19.270087+00 |
| updated_at | 2025-08-23 00:36:52.564451+00 |
| description | A high-performance matcher designed to solve LOGICAL and TEXT VARIATIONS problems in word matching, implemented in Rust. |
| homepage | https://github.com/Lips7/Matcher |
| repository | https://github.com/Lips7/Matcher |
| max_upload_size | |
| id | 1272768 |
| size | 93,515 |
A high-performance matcher designed to solve LOGICAL and TEXT VARIATIONS problems in word matching, implemented in Rust.
For detailed implementation, see the Design Document.
蟲艸 -> 虫艹*Fu&*iii&^%%*&kkkk -> Fuiiikkkk𝜢𝕰𝕃𝙻𝝧 𝙒ⓞᵣℒ𝒟! -> hello world!西安 -> xi an, matches 洗按 -> xi an, but not 先 -> xian西安 -> xian, matches 洗按 and 先 -> xianhello&world matches hello world and world,hello无&法&无&天 matches 无无法天 (because 无 is repeated twice), but not 无法天hello~helloo~hhello matches hello but not helloo and hhellopip install matcher_py
Visit the release page to download the pre-built binary.
All relevant types are defined in extension_types.py.
Matcher's configuration is defined by the MatchTableMap = Dict[int, List[MatchTable]] type, the key of MatchTableMap is called match_id, for each match_id, the table_id inside is required to be unique.SimpleMatcher's configuration is defined by the SimpleTable = Dict[ProcessType, Dict[int, str]] type, the value Dict[int, str]'s key is called word_id, word_id is required to be globally unique.table_id: The unique ID of the match table.match_table_type: The type of the match table.word_list: The word list of the match table.exemption_process_type: The type of the exemption simple match.exemption_word_list: The exemption word list of the match table.For each match table, word matching is performed over the word_list, and exemption word matching is performed over the exemption_word_list. If the exemption word matching result is True, the word matching result will be False.
Simple: Supports simple multiple patterns matching with text normalization defined by process_type.
& and ~, such as hello&world&hello will match hellohelloworld and worldhellohello, but not helloworld due to the repeated times of hello.Regex: Supports regex patterns matching.
SimilarChar: Supports similar character matching using regex.
["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"] will match helloworld!, hollowrd?, hi🌍~ ··· any combinations of the words split by , in the list.Acrostic: Supports acrostic matching using regex (currently only supports Chinese and simple English sentences).
["h,e,l,l,o", "你,好"] will match hope, endures, love, lasts, onward. and 你的笑容温暖, 好心情常伴。.Regex: Supports regex matching.
["h[aeiou]llo", "w[aeiou]rd"] will match hello, world, hillo, wurld ··· any text that matches the regex in the list.Similar: Supports similar text matching based on distance and threshold.
Levenshtein: Supports similar text matching based on Levenshtein distance.None: No transformation.Fanjian: Traditional Chinese to simplified Chinese transformation. Based on FANJIAN.
妳好 -> 你好現⾝ -> 现身Delete: Delete all punctuation, special characters and white spaces. Based on TEXT_DELETE and WHITE_SPACE.
hello, world! -> helloworld《你∷好》 -> 你好Normalize: Normalize all English character variations and number variations to basic characters. Based on NORM and NUM_NORM.
ℋЀ⒈㈠Õ -> he11o⒈Ƨ㊂ -> 123PinYin: Convert all unicode Chinese characters to pinyin with boundaries. Based on PINYIN.
你好 -> ni hao西安 -> xi anPinYinChar: Convert all unicode Chinese characters to pinyin without boundaries. Based on PINYIN.
你好 -> nihao西安 -> xianYou can combine these transformations as needed. Pre-defined combinations like DeleteNormalize and FanjianDeleteNormalize are provided for convenience.
Avoid combining PinYin and PinYinChar due to that PinYin is a more limited version of PinYinChar, in some cases like xian, can be treat as two words xi and an, or only one word xian.
Here’s an example of how to use the reduce_text_process and text_process functions:
from matcher_py import reduce_text_process, text_process
from matcher_py.extension_types import ProcessType
print(reduce_text_process(ProcessType.MatchDeleteNormalize, "hello, world!"))
print(text_process(ProcessType.MatchDelete, "hello, world!"))
Here’s an example of how to use the Matcher:
import json
from matcher_py import Matcher
from matcher_py.extension_types import MatchTable, MatchTableType, ProcessType, RegexMatchType, SimMatchType
matcher = Matcher(
json.dumps({
1: [
MatchTable(
table_id=1,
match_table_type=MatchTableType.Simple(process_type = ProcessType.MatchFanjianDeleteNormalize),
word_list=["hello", "world"],
exemption_process_type=ProcessType.MatchNone,
exemption_word_list=["word"],
),
MatchTable(
table_id=2,
match_table_type=MatchTableType.Regex(
process_type = ProcessType.MatchFanjianDeleteNormalize,
regex_match_type=RegexMatchType.Regex
),
word_list=["h[aeiou]llo"],
exemption_process_type=ProcessType.MatchNone,
exemption_word_list=[],
)
],
2: [
MatchTable(
table_id=3,
match_table_type=MatchTableType.Similar(
process_type = ProcessType.MatchFanjianDeleteNormalize,
sim_match_type=SimMatchType.MatchLevenshtein,
threshold=0.5
),
word_list=["halxo"],
exemption_process_type=ProcessType.MatchNone,
exemption_word_list=[],
)
]
}).encode()
)
# Check if a text matches
assert matcher.is_match("hello")
assert not matcher.is_match("word")
# Perform process as a list
result = matcher.process("hello")
assert result == [{'match_id': 1,
'table_id': 2,
'word_id': 0,
'word': 'h[aeiou]llo',
'similarity': 1.0},
{'match_id': 1,
'table_id': 1,
'word_id': 0,
'word': 'hello',
'similarity': 1.0},
{'match_id': 2,
'table_id': 3,
'word_id': 0,
'word': 'halxo',
'similarity': 0.6}]
# Perform word matching as a dict
assert matcher.word_match(r"hello, world")[1] == [{'match_id': 1,
'table_id': 2,
'word_id': 0,
'word': 'h[aeiou]llo',
'similarity': 1.0},
{'match_id': 1,
'table_id': 1,
'word_id': 0,
'word': 'hello',
'similarity': 1.0},
{'match_id': 1,
'table_id': 1,
'word_id': 1,
'word': 'world',
'similarity': 1.0}]
# Perform word matching as a string
result = matcher.word_match_as_string("hello")
assert result == """{"2":[{"match_id":2,"table_id":3,"word_id":0,"word":"halxo","similarity":0.6}],"1":[{"match_id":1,"table_id":2,"word_id":0,"word":"h[aeiou]llo","similarity":1.0},{"match_id":1,"table_id":1,"word_id":0,"word":"hello","similarity":1.0}]}"""
Here’s an example of how to use the SimpleMatcher:
import json
from matcher_py import SimpleMatcher
from matcher_py.extension_types import ProcessType
simple_matcher = SimpleMatcher(
json.dumps(
{
ProcessType.MatchNone: {
1: "hello&world",
2: "word&word~hello"
},
ProcessType.MatchDelete: {
3: "hallo"
}
}
).encode()
)
# Check if a text matches
assert simple_matcher.is_match("hello^&!#*#&!^#*()world")
# Perform simple processing
result = simple_matcher.process("hello,world,word,word,hallo")
assert result == [{'word_id': 1, 'word': 'hello&world'}, {'word_id': 3, 'word': 'hallo'}]
Contributions to matcher_py are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.
matcher_py is licensed under the MIT OR Apache-2.0 license.
For more details, visit the GitHub repository.