| Crates.io | regex-tokenizer |
| lib.rs | regex-tokenizer |
| version | 0.1.1 |
| created_at | 2023-03-22 18:59:01.889224+00 |
| updated_at | 2023-03-22 19:08:48.423678+00 |
| description | A regex tokenizer |
| homepage | https://github.com/cmargiotta/regex-tokenizer |
| repository | https://github.com/cmargiotta/regex-tokenizer |
| max_upload_size | |
| id | 817401 |
| size | 11,517 |
A regex-based tokenizer with a minimal DSL to define it!
tokenizer! {
SimpleTokenizer
r"[a-zA-Z]\w*" => Identifier
r"\d+" => Number
r"\s+" => _
}
And, in a function
...
let tokenizer = SimpleTokenizer::new();
...
SimpleTokenizer will generate an enum called SimpleTokenyzer_types, containing Identifier and Number. Regexes with _ as class are ignored; when a substring that does not match a specified regex is found, the tokenization is considered failed.
When multiple non-ignored regexes match with an input, priority is given to the one defined first.
Calling tokenizer.tokenize(...) will return an iterator that extracts tokens from the query.
A token is formed by:
{
value: String,
position: usize,
type_: SimpleTokenyzer_types,
}
position will be the position of the token's first character inside the query. A call to .next() will return None if there are no more tokens to extract.