Crates.io | matcher_py |
lib.rs | matcher_py |
version | |
source | src |
created_at | 2024-06-15 12:05:19.270087 |
updated_at | 2024-10-14 03:00:26.468443 |
description | A high-performance matcher designed to solve LOGICAL and TEXT VARIATIONS problems in word matching, implemented in Rust. |
homepage | https://github.com/Lips7/Matcher |
repository | https://github.com/Lips7/Matcher |
max_upload_size | |
id | 1272768 |
Cargo.toml error: | TOML parse error at line 19, column 1 | 19 | autolib = false | ^^^^^^^ unknown field `autolib`, expected one of `name`, `version`, `edition`, `authors`, `description`, `readme`, `license`, `repository`, `homepage`, `documentation`, `build`, `resolver`, `links`, `default-run`, `default_dash_run`, `rust-version`, `rust_dash_version`, `rust_version`, `license-file`, `license_dash_file`, `license_file`, `licenseFile`, `license_capital_file`, `forced-target`, `forced_dash_target`, `autobins`, `autotests`, `autoexamples`, `autobenches`, `publish`, `metadata`, `keywords`, `categories`, `exclude`, `include` |
size | 0 |
A high-performance matcher designed to solve LOGICAL and TEXT VARIATIONS problems in word matching, implemented in Rust.
For detailed implementation, see the Design Document.
蟲艸
-> 虫艹
*Fu&*iii&^%%*&kkkk
-> Fuiiikkkk
𝜢𝕰𝕃𝙻𝝧 𝙒ⓞᵣℒ𝒟!
-> hello world!
西安
-> xi an
, matches 洗按
-> xi an
, but not 先
-> xian
西安
-> xian
, matches 洗按
and 先
-> xian
hello&world
matches hello world
and world,hello
无&法&无&天
matches 无无法天
(because 无
is repeated twice), but not 无法天
hello~helloo~hhello
matches hello
but not helloo
and hhello
pip install matcher_py
Visit the release page to download the pre-built binary.
All relevant types are defined in extension_types.py.
Matcher
's configuration is defined by the MatchTableMap = Dict[int, List[MatchTable]]
type, the key of MatchTableMap
is called match_id
, for each match_id
, the table_id
inside is required to be unique.SimpleMatcher
's configuration is defined by the SimpleTable = Dict[ProcessType, Dict[int, str]]
type, the value Dict[int, str]
's key is called word_id
, word_id
is required to be globally unique.table_id
: The unique ID of the match table.match_table_type
: The type of the match table.word_list
: The word list of the match table.exemption_process_type
: The type of the exemption simple match.exemption_word_list
: The exemption word list of the match table.For each match table, word matching is performed over the word_list
, and exemption word matching is performed over the exemption_word_list
. If the exemption word matching result is True, the word matching result will be False.
Simple
: Supports simple multiple patterns matching with text normalization defined by process_type
.
&
and ~
, such as hello&world&hello
will match hellohelloworld
and worldhellohello
, but not helloworld
due to the repeated times of hello
.Regex
: Supports regex patterns matching.
SimilarChar
: Supports similar character matching using regex.
["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"]
will match helloworld!
, hollowrd?
, hi🌍~
··· any combinations of the words split by ,
in the list.Acrostic
: Supports acrostic matching using regex (currently only supports Chinese and simple English sentences).
["h,e,l,l,o", "你,好"]
will match hope, endures, love, lasts, onward.
and 你的笑容温暖, 好心情常伴。
.Regex
: Supports regex matching.
["h[aeiou]llo", "w[aeiou]rd"]
will match hello
, world
, hillo
, wurld
··· any text that matches the regex in the list.Similar
: Supports similar text matching based on distance and threshold.
Levenshtein
: Supports similar text matching based on Levenshtein distance.None
: No transformation.Fanjian
: Traditional Chinese to simplified Chinese transformation. Based on FANJIAN.
妳好
-> 你好
現⾝
-> 现身
Delete
: Delete all punctuation, special characters and white spaces. Based on TEXT_DELETE and WHITE_SPACE
.
hello, world!
-> helloworld
《你∷好》
-> 你好
Normalize
: Normalize all English character variations and number variations to basic characters. Based on NORM and NUM_NORM.
ℋЀ⒈㈠Õ
-> he11o
⒈Ƨ㊂
-> 123
PinYin
: Convert all unicode Chinese characters to pinyin with boundaries. Based on PINYIN.
你好
-> ni hao
西安
-> xi an
PinYinChar
: Convert all unicode Chinese characters to pinyin without boundaries. Based on PINYIN.
你好
-> nihao
西安
-> xian
You can combine these transformations as needed. Pre-defined combinations like DeleteNormalize
and FanjianDeleteNormalize
are provided for convenience.
Avoid combining PinYin
and PinYinChar
due to that PinYin
is a more limited version of PinYinChar
, in some cases like xian
, can be treat as two words xi
and an
, or only one word xian
.
Here’s an example of how to use the reduce_text_process
and text_process
functions:
from matcher_py import reduce_text_process, text_process
from matcher_py.extension_types import ProcessType
print(reduce_text_process(ProcessType.MatchDeleteNormalize, "hello, world!"))
print(text_process(ProcessType.MatchDelete, "hello, world!"))
Here’s an example of how to use the Matcher
:
import json
from matcher_py import Matcher
from matcher_py.extension_types import MatchTable, MatchTableType, ProcessType, RegexMatchType, SimMatchType
matcher = Matcher(
json.dumps({
1: [
MatchTable(
table_id=1,
match_table_type=MatchTableType.Simple(process_type = ProcessType.MatchFanjianDeleteNormalize),
word_list=["hello", "world"],
exemption_process_type=ProcessType.MatchNone,
exemption_word_list=["word"],
),
MatchTable(
table_id=2,
match_table_type=MatchTableType.Regex(
process_type = ProcessType.MatchFanjianDeleteNormalize,
regex_match_type=RegexMatchType.Regex
),
word_list=["h[aeiou]llo"],
exemption_process_type=ProcessType.MatchNone,
exemption_word_list=[],
)
],
2: [
MatchTable(
table_id=3,
match_table_type=MatchTableType.Similar(
process_type = ProcessType.MatchFanjianDeleteNormalize,
sim_match_type=SimMatchType.MatchLevenshtein,
threshold=0.5
),
word_list=["halxo"],
exemption_process_type=ProcessType.MatchNone,
exemption_word_list=[],
)
]
}).encode()
)
# Check if a text matches
assert matcher.is_match("hello")
assert not matcher.is_match("word")
# Perform process as a list
result = matcher.process("hello")
assert result == [{'match_id': 1,
'table_id': 2,
'word_id': 0,
'word': 'h[aeiou]llo',
'similarity': 1.0},
{'match_id': 1,
'table_id': 1,
'word_id': 0,
'word': 'hello',
'similarity': 1.0},
{'match_id': 2,
'table_id': 3,
'word_id': 0,
'word': 'halxo',
'similarity': 0.6}]
# Perform word matching as a dict
assert matcher.word_match(r"hello, world")[1] == [{'match_id': 1,
'table_id': 2,
'word_id': 0,
'word': 'h[aeiou]llo',
'similarity': 1.0},
{'match_id': 1,
'table_id': 1,
'word_id': 0,
'word': 'hello',
'similarity': 1.0},
{'match_id': 1,
'table_id': 1,
'word_id': 1,
'word': 'world',
'similarity': 1.0}]
# Perform word matching as a string
result = matcher.word_match_as_string("hello")
assert result == """{"2":[{"match_id":2,"table_id":3,"word_id":0,"word":"halxo","similarity":0.6}],"1":[{"match_id":1,"table_id":2,"word_id":0,"word":"h[aeiou]llo","similarity":1.0},{"match_id":1,"table_id":1,"word_id":0,"word":"hello","similarity":1.0}]}"""
Here’s an example of how to use the SimpleMatcher
:
import json
from matcher_py import SimpleMatcher
from matcher_py.extension_types import ProcessType
simple_matcher = SimpleMatcher(
json.dumps(
{
ProcessType.MatchNone: {
1: "hello&world",
2: "word&word~hello"
},
ProcessType.MatchDelete: {
3: "hallo"
}
}
).encode
)
# Check if a text matches
assert simple_matcher.is_match("hello^&!#*#&!^#*()world")
# Perform simple processing
result = simple_matcher.process("hello,world,word,word,hallo")
assert result == [{'word_id': 1, 'word': 'hello&world'}, {'word_id': 3, 'word': 'hallo'}]
Contributions to matcher_py
are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.
matcher_py
is licensed under the MIT OR Apache-2.0 license.
For more details, visit the GitHub repository.