| Crates.io | lexxor |
| lib.rs | lexxor |
| version | 0.9.2 |
| created_at | 2025-05-16 23:29:29.922068+00 |
| updated_at | 2025-11-14 00:21:08.85482+00 |
| description | A fast, extensible, greedy, single-pass text tokenizer for Rust |
| homepage | |
| repository | https://github.com/JeffThomas/lexx |
| max_upload_size | |
| id | 1677519 |
| size | 2,146,527 |
A fast, extensible, greedy, single-pass text tokenizer implemented in Rust. Lexxor is designed for high-performance tokenization with minimal memory allocations, making it suitable for parsing large files or real-time text processing.
Lexxor is a tokenizer library that allows you to define and compose various token matching strategies. It processes input character-by-character, identifying the longest possible match at each position using a set of configurable matchers. It includes a precedence mechanism for resolving matcher conflicts.
Lexxor consists of four main components:
Lexxor provides several built-in matchers for common token types:
WordMatcher: Matches alphabetic wordsIntegerMatcher: Matches integer numbersFloatMatcher: Matches floating-point numbersSymbolMatcher: Matches non-alphanumeric symbolsWhitespaceMatcher: Matches whitespace characters (spaces, tabs, newlines)KeywordMatcher: Matches specific keywords (but not as substrings)ExactMatcher: Matches exact string patterns (operators, delimiters, etc.)Matchers can be assigned precedence values to resolve conflicts when multiple matchers could match the same input. This allows for sophisticated tokenization strategies, such as recognizing keywords as distinct from regular words.
use lexxor::Lexxor;
use lexxor::input::InputString;
use lexxor::matcher::word::WordMatcher;
use lexxor::matcher::whitespace::WhitespaceMatcher;
use lexxor::matcher::symbol::SymbolMatcher;
use lexxor::matcher::integer::IntegerMatcher;
use lexxor::matcher::float::FloatMatcher;
fn main() {
// Create a simple input string
let input_text = "Hello world! This is 42 and 3.14159.";
let input = InputString::new(input_text.to_string());
// Create a Lexxor tokenizer with standard matchers
let lexx = Lexxor::<512>::new(
Box::new(input),
vec![
Box::new(WhitespaceMatcher { index: 0, column: 0, line: 0, precedence: 0, running: true }),
Box::new(WordMatcher { index: 0, precedence: 0, running: true }),
Box::new(IntegerMatcher { index: 0, precedence: 0, running: true }),
Box::new(FloatMatcher { index: 0, precedence: 0, dot: false, float: false, running: true }),
Box::new(SymbolMatcher { index: 0, precedence: 0, running: true }),
]
);
// Process tokens using the Iterator interface
for token in lexx {
println!("{}", token);
}
}
You can create custom matchers by implementing the Matcher trait:
use lexxor::matcher::{Matcher, MatcherResult};
use lexxor::token::{Token, TOKEN_TYPE_CUSTOM};
use std::collections::HashMap;
use std::fmt::Debug;
// Define a custom token type
const TOKEN_TYPE_HEX_COLOR: u16 = 200;
#[derive(Debug)]
struct HexColorMatcher {
index: usize,
precedence: u8,
running: bool,
}
impl Matcher for HexColorMatcher {
fn reset(&mut self, _ctx: &mut Box<HashMap<String, i32>>) {
self.index = 0;
self.running = true;
}
fn find_match(
&mut self,
oc: Option<char>,
value: &[char],
_ctx: &mut Box<HashMap<String, i32>>,
) -> MatcherResult {
// Implementation for matching hex color codes
// ...
}
fn is_running(&self) -> bool {
self.running
}
fn precedence(&self) -> u8 {
self.precedence
}
}
Lexxor is optimized for high-performance tokenization:
| Benchmark | Time |
|---|---|
| Small file (15 bytes) | ~1.2 µs |
| UTF-8 sample (13 KB) | ~350 µs |
| Large file (1.8 MB) | ~45 ms |
These benchmarks were measured on standard hardware. Your results may vary depending on your system specifications.
Lexx<CAP> where CAP is the maximum token sizeAdd Lexxor to your Cargo.toml:
[dependencies]
lexxor = "0.1.0"
Lexxor defines several standard token types:
TOKEN_TYPE_WHITESPACE (3): Whitespace charactersTOKEN_TYPE_WORD (4): Word tokens (alphabetic characters)TOKEN_TYPE_INTEGER (1): Integer numbersTOKEN_TYPE_FLOAT (2): Floating point numbersTOKEN_TYPE_SYMBOL (5): Symbol charactersTOKEN_TYPE_EXACT (6): Exact string matchesTOKEN_TYPE_KEYWORD (7): Reserved keywordsYou can define custom token types starting from higher numbers (e.g., 100+) for your application-specific needs.
Lexxor supports multiple input sources through the LexxorInput trait:
InputString: Tokenize from a StringInputReader: Tokenize from any source implementing ReadYou can implement custom input sources by implementing the LexxorInput trait.
Lexxor returns LexxError in two cases:
TokenNotFound: No matcher could match the current inputError: Some other error occurred during tokenizationTo successfully parse an entire input, ensure you have matchers that can handle all possible character sequences.
MIT License
Contributions are welcome! Please feel free to submit a Pull Request.