Crates.io | lexr |
lib.rs | lexr |
version | 0.1.0 |
source | src |
created_at | 2023-11-19 16:06:04.833384 |
updated_at | 2023-11-19 16:06:04.833384 |
description | Flexible, powerful and simple lexing in Rust |
homepage | |
repository | https://github.com/JENebel/lexr-parsr.git |
max_upload_size | |
id | 1041279 |
size | 34,810 |
Lexr is a simple and flexible lexing library for Rust. It is designed to be used on its own, or in conjunction with parsr.
The syntax consists of a single macro, lex_rule!
which is used to define a lexing rule.
The macro generates a function that can be called to produce a lexer. The lexer is an iterator
over the input string, producing tokens and locations as it goes.
If you encounter any issues or have suggestions, please report them here.
Here is a simple example of a lexer that recognizes the tokens A
, B
, and C
:
use lexr::lex_rule;
#[derive(Debug, PartialEq)]
enum Token {
A, B, C,
}
use Token::*;
lex_rule!{lex -> Token {
"a" => |_| A,
"b" => |_| B,
"c" => |_| C,
}}
let tokens = lex("abc").into_token_vec();
assert_eq!(tokens, vec![A, B, C])
The lex_rule!
macro is used to define a lexer.
The lex rule has a name, a token type, and any number of patterns with associated actions. The syntax is as follows:
lex_rule!{NAME(ARGS) -> TOKEN {
PATTERN => ACTION,
...
}}
NAME
is the name of the function that is generated by the macro. This function can be called to produce a lexer.TOKEN
is the type of the tokens that the lexer produces. This can be any type, including void.continue
or break
.The rules consist of a pattern and an action resulting in a token.
The order of the patterns is important, as the first that matches is chosen.
Patterns are matched to the beginning of the input in the order they are defined.
Patterns can be the following:
_
that matches any single character. This does not match eof.eof
, which matches the end of the input. This is optional, and if not provided, end of file is just ignored.ws
, which matches any whitespace character.Here is an example showing the different legal patterns
use lexr::lex_rule;
#[derive(Debug, PartialEq)]
enum Token {
A, B, C, D, Num, Eof
}
use Token::*;
const A_RULE: &str = "a";
lex_rule!{lex -> Token {
ws => |_| continue, // Matches whitespace
"a" => |_| A, // Matches "a"
"b" "a" => |_| B, // Matches "bc"
"c" A_RULE => |_| C, // Matches "ba"
r"[0-9]+" => |_| Num, // Matches any number of digits
_ => |_| D, // Matches any single character
eof => |_| Eof, // Matches the end of the input
}}
let tokens = lex("a ba ca S 42").into_token_vec();
assert_eq!(tokens, vec![A, B, C, D, Num, Eof])
An action is a closure returning the token type provided in the macro definition.
It will run when the pattern matches, and can be used to produce a token, skip or stop lexing altogether.
There are 3 different signatures for the closure, which can be used to provide different parameters to the action:
|s|
- The action is provided with the matched string|s, buf|
- The action is provided with the matched string and a buffer. The buffer can be used to lex a subrule.|s, buf, loc|
- The action is provided with the matched string, a buffer, and a location. The location is the location of the matched string in the input.Only the first argument is required, the rest are optional. They can all be ignored with an underscore _
.
This means that if no arguments are needed, the signature can be written as |_|
.
For instance if only the location is of interest, the other arguments can be ignored with an underscore: |_, _, loc|
.
The actions themselves can be any expression that returns a token or continues
or breaks
.
Continue and break works as follows:
continue
- This skips the current token and returns the next token instead.break
- This stops the lexer and thus the iterator will return None when this is encountered.Notably it is possible to call [sub rules](# Sub Rules) from the action.
Here is an example showing the different legal actions
use lexr::lex_rule;
#[derive(Debug, PartialEq)]
enum Token {
A, Num(i32), Eof
}
use Token::*;
lex_rule!{lex -> Token {
// Returns A
"a" => |_| A,
// Matches any whitespace and skips it
r"[ \n\t\r]" => |_| continue,
// Stops the lexer
"x" => |_| break,
// Calls the sub rule and runs it until it it is done
"#" => |_, buf| { comment(buf).deplete(); continue },
// Parses the number and returns it
r"[0-9]+" => |s| Num(s.parse().unwrap()),
// Detects and returns Eof
eof => |_| Eof, // Returns Eof
}}
// A simple rule that ignores all characters until a '#' is encountered
lex_rule!{comment -> () {
"#" => |_| break,
_ => |_| continue,
}}
let tokens = lex("a # comment # 42 a").into_token_vec();
assert_eq!(tokens, vec![A, Num(42), A, Eof]);
let tokens = lex("aa 12 x aa").into_token_vec();
assert_eq!(tokens, vec![A, A, Num(12)]);
The arguments are passed to the lexer function, and can be used to pass arguments to a lexer. These can be used to for instance pass context information, or to pass arguments to sub rules.
Here is an example showing how to pass an argument to a lexer:
use lexr::lex_rule;
#[derive(Debug, PartialEq)]
enum Token {
A, B(i32), Eof
}
use Token::*;
lex_rule!{lex(arg: i32) -> Token {
"a" => |_| A,
"b" => |_| B(arg),
eof => |_| Eof,
}}
let tokens = lex("ab", 12).into_token_vec();
assert_eq!(tokens, vec![A, B(12), Eof]);
Sub rules are lex rules that are called from the action of another lex rule.
This call will then operate on the samme buffer, and thus the sub rule will mutate the buffer.
This can be used to lex for instance comments, or even entire sub languages.
Be aware that calling sub rules is not tail recursive, so use it with caution, and not as the main way to lex.
Also make sure that the sub rule is run,
otherwise nothing happens. This can be done by calling deplete
to run to end,
or next
for a single token.
Here is an example showing how to call a sub rule:
use lexr::lex_rule;
#[derive(Debug, PartialEq)]
enum Token {
A, Eof
}
use Token::*;
lex_rule!{lex -> Token {
ws => |_| continue,
"a" => |_| A,
r"\(\*" => |_, buf| { comment(buf, 0).next(); continue },
eof => |_| Eof,
}}
lex_rule!{comment(depth: u16) -> () {
r"\(\*" => |_, buf| {comment(buf, depth + 1).next(); break},
r"\*\)" => |_, buf|
if depth == 0 {
break
} else {
comment(buf, depth - 1).next();
break
},
eof => |_| panic!("Unclosed comment!"),
_ => |_| continue,
}}
let tokens = lex("a (* comment (* inner *) comment *) aa").into_token_vec();
assert_eq!(tokens, vec![A, A, A, Eof]);
License: MIT