# lexr Lexr is a simple and flexible lexing library for Rust. It is designed to be used on its own, or in conjunction with [parsr](https://github.com/JENebel/lexr-parsr/tree/master/parsr). The syntax consists of a single macro, [`lex_rule!`](crate::lex_rule!) which is used to define a lexing rule. The macro generates a function that can be called to produce a lexer. The lexer is an iterator over the input string, producing tokens and locations as it goes. If you encounter any issues or have suggestions, please report them [here](https://github.com/JENebel/lexr-parsr/issues). Here is a simple example of a lexer that recognizes the tokens `A`, `B`, and `C`: ```rust use lexr::lex_rule; #[derive(Debug, PartialEq)] enum Token { A, B, C, } use Token::*; lex_rule!{lex -> Token { "a" => |_| A, "b" => |_| B, "c" => |_| C, }} let tokens = lex("abc").into_token_vec(); assert_eq!(tokens, vec![A, B, C]) ``` ## Macro Syntax The `lex_rule!` macro is used to define a lexer. The lex rule has a name, a token type, and any number of patterns with associated actions. The syntax is as follows: ```rust lex_rule!{NAME(ARGS) -> TOKEN { PATTERN => ACTION, ... }} ``` - `NAME` is the name of the function that is generated by the macro. This function can be called to produce a lexer. - [ARGS](#args) is an optional list of arguments that are passed to the lexer. - `TOKEN` is the type of the tokens that the lexer produces. This can be any type, including void. - [PATTERN](#patterns) is a pattern that the lexer matches against the input. If the pattern matches, the action is executed. - [ACTION](#actions) is an expression that is executed if the pattern matches. The expression must produce a token or `continue` or `break`. The rules consist of a pattern and an action resulting in a token.\ The order of the patterns is important, as the first that matches is chosen. ### Patterns Patterns are matched to the beginning of the input in the order they are defined. Patterns can be the following: - One ore more string slice literals or constants. These strings are concatenated together, and used for regex matching. - A wildcard `_` that matches any single character. This does not match eof. - `eof`, which matches the end of the input. This is optional, and if not provided, end of file is just ignored. - `ws`, which matches any whitespace character. Here is an example showing the different legal patterns ```rust use lexr::lex_rule; #[derive(Debug, PartialEq)] enum Token { A, B, C, D, Num, Eof } use Token::*; const A_RULE: &str = "a"; lex_rule!{lex -> Token { ws => |_| continue, // Matches whitespace "a" => |_| A, // Matches "a" "b" "a" => |_| B, // Matches "bc" "c" A_RULE => |_| C, // Matches "ba" r"[0-9]+" => |_| Num, // Matches any number of digits _ => |_| D, // Matches any single character eof => |_| Eof, // Matches the end of the input }} let tokens = lex("a ba ca S 42").into_token_vec(); assert_eq!(tokens, vec![A, B, C, D, Num, Eof]) ``` ### Actions An action is a closure returning the token type provided in the macro definition.\ It will run when the pattern matches, and can be used to produce a token, skip or stop lexing altogether. #### Signature There are 3 different signatures for the closure, which can be used to provide different parameters to the action: - `|s|` - The action is provided with the matched string - `|s, buf|` - The action is provided with the matched string and a buffer. The buffer can be used to lex a subrule. - `|s, buf, loc|` - The action is provided with the matched string, a buffer, and a location. The location is the location of the matched string in the input. Only the first argument is required, the rest are optional. They can all be ignored with an underscore `_`.\ This means that if no arguments are needed, the signature can be written as `|_|`.\ For instance if only the location is of interest, the other arguments can be ignored with an underscore: `|_, _, loc|`. #### Action The actions themselves can be any expression that returns a token or `continues` or `breaks`. Continue and break works as follows: - `continue` - This skips the current token and returns the next token instead. - `break` - This stops the lexer and thus the iterator will return None when this is encountered. Notably it is possible to call [sub rules](# Sub Rules) from the action. Here is an example showing the different legal actions ```rust use lexr::lex_rule; #[derive(Debug, PartialEq)] enum Token { A, Num(i32), Eof } use Token::*; lex_rule!{lex -> Token { // Returns A "a" => |_| A, // Matches any whitespace and skips it r"[ \n\t\r]" => |_| continue, // Stops the lexer "x" => |_| break, // Calls the sub rule and runs it until it it is done "#" => |_, buf| { comment(buf).deplete(); continue }, // Parses the number and returns it r"[0-9]+" => |s| Num(s.parse().unwrap()), // Detects and returns Eof eof => |_| Eof, // Returns Eof }} // A simple rule that ignores all characters until a '#' is encountered lex_rule!{comment -> () { "#" => |_| break, _ => |_| continue, }} let tokens = lex("a # comment # 42 a").into_token_vec(); assert_eq!(tokens, vec![A, Num(42), A, Eof]); let tokens = lex("aa 12 x aa").into_token_vec(); assert_eq!(tokens, vec![A, A, Num(12)]); ``` ## Args The arguments are passed to the lexer function, and can be used to pass arguments to a lexer. These can be used to for instance pass context information, or to pass arguments to sub rules. Here is an example showing how to pass an argument to a lexer: ```rust use lexr::lex_rule; #[derive(Debug, PartialEq)] enum Token { A, B(i32), Eof } use Token::*; lex_rule!{lex(arg: i32) -> Token { "a" => |_| A, "b" => |_| B(arg), eof => |_| Eof, }} let tokens = lex("ab", 12).into_token_vec(); assert_eq!(tokens, vec![A, B(12), Eof]); ``` ## Sub Rules Sub rules are lex rules that are called from the action of another lex rule.\ This call will then operate on the samme buffer, and thus the sub rule will mutate the buffer.\ This can be used to lex for instance comments, or even entire sub languages. Be aware that calling sub rules is not tail recursive, so use it with caution, and not as the main way to lex. Also make sure that the sub rule is run, otherwise nothing happens. This can be done by calling `deplete` to run to end, or `next` for a single token. Here is an example showing how to call a sub rule: ```rust use lexr::lex_rule; #[derive(Debug, PartialEq)] enum Token { A, Eof } use Token::*; lex_rule!{lex -> Token { ws => |_| continue, "a" => |_| A, r"\(\*" => |_, buf| { comment(buf, 0).next(); continue }, eof => |_| Eof, }} lex_rule!{comment(depth: u16) -> () { r"\(\*" => |_, buf| {comment(buf, depth + 1).next(); break}, r"\*\)" => |_, buf| if depth == 0 { break } else { comment(buf, depth - 1).next(); break }, eof => |_| panic!("Unclosed comment!"), _ => |_| continue, }} let tokens = lex("a (* comment (* inner *) comment *) aa").into_token_vec(); assert_eq!(tokens, vec![A, A, A, Eof]); ``` License: MIT