Crates.io | blex |
lib.rs | blex |
version | 0.2.2 |
source | src |
created_at | 2023-03-11 23:32:31.432237 |
updated_at | 2023-03-24 23:11:50.180209 |
description | A lightweight lexing framework |
homepage | https://github.com/mondobe/blex |
repository | https://github.com/mondobe/blex |
max_upload_size | |
id | 807606 |
size | 27,275 |
Blex is a lightweight lexing framework built in Rust based on the T-Lex framework. Blex is built around the concept of building a set of rules that process a set of tokens. Each rule is run throughout the input string successively, gradually transforming it into an output string of tokens.
The implementation of tokens itself is the idea most heavily borrowed from Rustuck. Tokens in Blex are represented as a contiguous range of characters from the original string along with a set of tags, themselves also represented as borrowed strings. For example, the string let x: int = 2;
might be represented as:
"let": var_dec, keyword
"x": ident
":": colon, keyword
"int": ident
"=": eq, keyword
"2": int
Notice the lack of whitespace: keywords don't have to represent the entire string.
Rules in Blex are any function that transforms a vector of tokens into an optional vector of tokens. In Rust, that refers to this trait bound: Fn(Vec<Token>) -> Option<Vec<Token>>
. Rules may modify the input tokens without worrying about mutating the original string: tokens are cloned before the rules are applied to them (luckily, cloning is a very cheap operation for tokens, which are just a range and a few pointers).
Rules are processed with the process_rule
and process_rules
function. The process_rule
function applies the rule starting on each token in a list. Rules are applied starting with a single token, which is wrapped in a Vec
, passed into the function, and processed. If the function returns None
, the starting token and the next token are combined into a Vec
and passed into the function. If the function returns Some
, the tokens passed in are replaced with the returned Vec
.
Rules are processed on strings of tokens, but strings of characters are transformed into tokens using the str_to_tokens
function. Each character of a string is turned into a corresponding token, and one tag is added with the content of the character, a holdover from Rustuck.
Let's create a rule that takes two consecutive tokens that read ab
and converts them into one token with the tag c
. The rule would start out like this:
fn ab_rule(tokens: Vec<Token>) -> Option<Vec<Token>> {
}
Let's say we input the string a b blex ab abab
. The string will be turned into these tokens:
"a": a;
" ": ;
"b": b;
" ": ;
"b": b;
"l": l;
"e": e;
"x": x;
" ": ;
"a": a;
"b": b;
" ": ;
"a": a;
"b": b;
"a": a;
"b": b;
"":
Notice the empty token at the end. We didn't type that: it was added automatically to give some rules, like those testing for words, a buffer. Our rule will start by scanning each token individually. Remember that we are scanning for the pattern "ab". Let's chart out each possible route our rule could take.
a
, we should continue the rule to include the next token.a
. Knowing that, there are two paths we can take:
b
, we should combine those tokens and give it the tag c
.Luckily, blex has idiomatic ways to express this. We'll start by match
ing on token_structure(&tokens)
, which has two options:
fn ab_rule(tokens: Vec<Token>) -> Option<Vec<Token>> {
match tokens_structure(&tokens) {
TokenStructure::Single(tok) => {
},
TokenStructure::Multiple => {
}
}
}
The token_structure
function takes in a borrowed Vec
of tokens. If it finds that the Vec
holds only one token, it returns that token wrapped in a Single
. Otherwise, it returns Multiple
. This is a safe way to guarantee the existence of one token.
There is also an idiomatic way to test a token for certain tags in the has_tag
method. We can use this to test both cases for the a
and b
tags respectively.
fn ab_rule(tokens: Vec<Token>) -> Option<Vec<Token>> {
match tokens_structure(&tokens) {
TokenStructure::Single(tok) => {
if tok.has_tag("a") {
// case 1
} else {
// case 2
}
},
TokenStructure::Multiple => {
if tokens[1].has_tag("b") {
// case 3
} else {
// case 4
}
}
}
}
The only thing left is the return values:
None
.Some(tokens)
.c
. Of course, there is an idiomatic way to do this as well: wrap
. The wrap
function takes a Vec
of tokens (which are assumed to be from the same string of tokens) and combines all of their contents into one token. The second argument of wrap
is a Vec<&str>
, which contains all of the tags to add to the new token.fn ab_rule(tokens: Vec<Token>) -> Option<Vec<Token>> {
match tokens_structure(&tokens) {
TokenStructure::Single(tok) => {
if tok.has_tag("a") {
None
} else {
Some(tokens)
}
},
TokenStructure::Multiple => {
if tokens[1].has_tag("b") {
Some(vec![wrap(tokens, vec!["c"])])
} else {
Some(tokens)
}
}
}
}
That completes our rule! We can test it with a script like this:
#[test]
fn apply_ab() {
let text = "a b blex ab abab";
let mut body = str_to_tokens(text);
process_rule(ab_rule, &mut body);
print_tokens(body);
}
Which gives us:
"a": a;
" ": ;
"b": b;
" ": ;
"b": b;
"l": l;
"e": e;
"x": x;
" ": ;
"ab": c;
" ": ;
"ab": c;
"ab": c;
"":