# rust-bison-skeleton A set of bison skeleton files that can be used to generate a Bison grammar that is written in Rust. Technically it's more like a Bison frontend for Rust. ## Requirements + Rust + Bison `3.7.3` or higher (or maybe a bit lower, it's unknown, better get the latest version) `bison` executable must be available in `$PATH`. ## Short explanation Bison is a parser generator, and in fact it doesn't really care what's your programming language. Under the hood it takes your `.y` file, parses it, extracts all derivations and then constructs a bunch of tables. Then, this data is passed to a template that is called `skeleton`. Simply treat it as JSX/ERB/Handlebars/etc view template. This skeleton is a special file written in M4 language (that is not really a programming language, it's closer to a macro engine) that (once rendered) prints your `.rs` file. As simple as that. ## Configuration Just like in C/C++/Java/D templates the following directives can be configured: + `%expect N` where `N` is a number of expected conflicts. Better set it to 0 + `%define api.parser.struct {Parser}` where `Parser` is the name of your parser struct. Optional, `Parser` is the default name. + `%define api.value.type {Value}` where `Value` is the name of the derivation result (and a stack item) struct. Optional, `Value` is the default name. + `%code use { }` allows you to specify a block of code that will be at the top of the file. Can be a multi-line block, optional, has no default value. + `%code parser_fields { }` allows you to specify additional custom fields for your `Parser` struct. Can be a multi-line block, optional, has no default value. + `%define api.parser.check_debug { /* expr */ }` allows you to configure printing debug information, `self` is an instance of your parser, so use something like this if you want to turn it into configurable field: ```bison %code parser_fields { debug: bool } %define api.parser.check_debug { self.debug } ``` All other directives that available in Bison can be configured too, read official Bison docs. ## Basic usage This skeleton generates an LALR(1) parser, and so parser has a stack. This stack is represented as `Vec` where `Value` is an enum that **must be defined by you**. The name of this enum must be set using `%define api.value.type {}` directive. Let's build a simple calculator that handles lines like `1 + (4 - 3) * 2`. First, let's define a boilerplate in `src/parser.y`: ```bison %expect 0 %define api.parser.struct {Parser} %define api.value.type {Value} %define parse.error custom %define parse.trace %code use { // all use goes here use crate::Loc; } %code parser_fields { // custom parser fields } %token tPLUS "+" tMINUS "-" tMUL "*" tDIV "/" tLPAREN "(" tRPAREN ")" tNUM "number" %left "-" "+" %left "*" "/" %% // rules %% impl Parser { // parser implementation } enum Value { // variants to define } ``` Currently this grammar has no rules, but it's a good start. This code (once compiled) defines a `Parser` struct at the top of the file that looks like this: ```rust #[derive(Debug)] pub struct Parser { pub yylexer: Lexer, yy_error_verbose: bool, yynerrs: i32, yyerrstatus_: i32, /* "%code parser_fields" blocks. */ } ``` Keep in mind that `Parser` auto-implements `std::fmt::Debug`, and so all custom fields also should implement it. `Value` enum is what is returned by derivations and what's stored in the stack of the parser. This enum must be defined by you and **it has** to have the following variants: + `Uninitialized` - a variant that is stored in `$$` by default (and what's overwritten by you) + `Stolen` - a variant that stack value is replaced with when you get it from the stack by writing `$` + `Token(TokenStruct)` - a variant that is used when shift if performed, holds your `TokenStruct` that is returned by a lexer Additionally you can have as many variants as you want, however they must represent what you return from derivation rules. In our case we want variants `Number` (to represent a numeric expression) and `None` (this is actually required to represent return value of the top-level rule). ```rust #[derive(Clone, Debug)] pub enum Value { None, Uninitialized, Stolen, Token(Token), Number(i32), } impl Default for Value { fn default() -> Self { Self::Stolen } } ``` It must implement `Clone`, `Debug` and `Default` (`.take()` is used under the hood that swaps `&mut Value` with `Value::default()`, so `default()` **must** return `Stolen` variant). Also skeleton defines a `Lexer` struct with a bunch of constants representing token numbers, it looks like this: ```rust // AUTO-GENERATED impl Lexer { /* Token kinds. */ // Token "end of file", to be returned by the scanner. #[allow(non_upper_case_globals, dead_code)] pub const YYEOF: i32 = 0; // Token error, to be returned by the scanner. #[allow(non_upper_case_globals, dead_code)] pub const YYerror: i32 = 256; // Token "invalid token", to be returned by the scanner. #[allow(non_upper_case_globals, dead_code)] pub const YYUNDEF: i32 = 257; // Token "+", to be returned by the scanner. #[allow(non_upper_case_globals, dead_code)] pub const tPLUS: i32 = 258; // Token "-", to be returned by the scanner. #[allow(non_upper_case_globals, dead_code)] pub const tMINUS: i32 = 259; // Token "*", to be returned by the scanner. #[allow(non_upper_case_globals, dead_code)] pub const tMUL: i32 = 260; // Token "/", to be returned by the scanner. #[allow(non_upper_case_globals, dead_code)] pub const tDIV: i32 = 261; // Token "(", to be returned by the scanner. #[allow(non_upper_case_globals, dead_code)] pub const tLPAREN: i32 = 262; // Token ")", to be returned by the scanner. #[allow(non_upper_case_globals, dead_code)] pub const tRPAREN: i32 = 263; // Token "number", to be returned by the scanner. #[allow(non_upper_case_globals, dead_code)] pub const tNUM: i32 = 264; } ``` Thus, we can define our lexer logic: ```rust use crate::{Loc, Value}; /// A token that is emitted by a lexer and consumed by a parser #[derive(Clone)] pub struct Token { // Required field, used by a skeleton pub token_type: i32, // Optional field, used by our custom parser pub token_value: i32, // Required field, used by a skeleton pub loc: Loc, } /// `Debug` implementation impl std::fmt::Debug for Token { fn fmt(&self, f: &mut std::fmt::Formatter<'_> /*' fix quotes */) -> std::fmt::Result { f.write_str(&format!( "[{}, {:?}, {}...{}]", token_name(self.token_type()), self.token_value, self.loc.begin, self.loc.end )) } } impl Token { /// Used by a parser to "unwrap" `Value::Token` variant into a plain Token value pub(crate) fn from(value: Value) -> Token { match value { Value::Token(v) => v, other => panic!("expected Token, got {:?}", other), } } } #[allow(non_upper_case_globals)] impl Lexer { pub fn new(src: &str) -> Self { let mut tokens = vec![]; for (idx, c) in src.chars().enumerate() { let (token_type, token_value) = match c { '0' => (Self::tNUM, 0), '1' => (Self::tNUM, 1), '2' => (Self::tNUM, 2), '3' => (Self::tNUM, 3), '4' => (Self::tNUM, 4), '5' => (Self::tNUM, 5), '6' => (Self::tNUM, 6), '7' => (Self::tNUM, 7), '8' => (Self::tNUM, 8), '9' => (Self::tNUM, 9), '+' => (Self::tPLUS, -1), '-' => (Self::tMINUS, -1), '*' => (Self::tMUL, -1), '/' => (Self::tDIV, -1), '(' => (Self::tLPAREN, -1), ')' => (Self::tRPAREN, -1), ' ' => continue, _ => panic!("unknown char {}", c), }; let token = Token { token_type, token_value, loc: Loc { begin: idx, end: idx + 1, }, }; tokens.push(token) } tokens.push(Token { token_type: Self::YYEOF, token_value: 0, loc: Loc { begin: src.len(), end: src.len() + 1, }, }); Self { tokens } } pub(crate) fn yylex(&mut self) -> Token { self.tokens.remove(0) } } ``` This lexer is not buffered and it does unnecessary work in case of a syntax error, but let's use at it's easier to understand. Now let's define `Parser` <-> `Lexer` composition: ```rust impl Parser { pub fn new(lexer: Lexer) -> Self { Self { yy_error_verbose: true, yynerrs: 0, yyerrstatus_: 0, yylexer: lexer, } } fn next_token(&mut self) -> Token { self.yylexer.yylex() } fn report_syntax_error(&self, ctx: &Context) { eprintln!("syntax error: {:#?}", ctx) } } ``` `Parser` encapsulates `Lexer` and calls it in a `next_token` method that is called by a skeleton. Time to define rules: ```bison %type expr number program %% program: expr { self.result = Some($1); $$ = Value::None; } | error { self.result = None; $$ = Value::None; } expr: number { $$ = $1; } | tLPAREN expr tRPAREN { $$ = $2; } | expr tPLUS expr { $$ = Value::Number($1 + $3); } | expr tMINUS expr { $$ = Value::Number($1 - $3); } | expr tMUL expr { $$ = Value::Number($1 * $3); } | expr tDIV expr { $$ = Value::Number($1 / $3); } number: tNUM { $$ = Value::Number($1.token_value()); } %% ``` As you can see our grammar has the following rules: ```bison program: expr | error expr: number | '(' number ')' | number '+' number | number '-' number | number '*' number | number '/' number number: [0-9] ``` `$$` is a return value and it has type `Value`. You can use `$1`, `$2`, etc to get items 1, 2, etc that **are no unwrapped**, i.e. that also have type `Value`. To unwrap it you can use `$1`, but then you must have the following method: ```rust impl Variant { fn from(value: Value) -> Self { match value { Value::Variant(out) => out, other => panic!("wrong type, expected Variant, got {:?}", other), } } } ``` In our case we want to have only one such variant - `Number`: ```rust use crate::Value; #[allow(non_snake_case)] pub(crate) mod Number { use super::Value; pub(crate) fn from(value: Value) -> i32 { match value { Value::Number(out) => out, other => panic!("wrong type, expected Number, got {:?}", other), } } } ``` Yes, it's a mod, but that's absolutely OK. It doesn't matter what `Variant` is, it's all about calling `Variant::from(Value)`. Also, as you might notice, there's a `self.result = ...` assignment in the top-level rule `program`. The reason why it's required is that there's no way to get value that is left on the stack because stack is not a part of the parser's state. This is why we also need to declare it: ```bison %code parser_fields { result: Option, } // And Parser's constructor must return fn new(lexer: Lexer) -> Self { Self { result: None, // ... } } ``` Now we need a `build.rs` script: ```rust extern crate rust_bison_skeleton; use rust_bison_skeleton::{process_bison_file, BisonErr}; use std::path::Path; fn main() { match process_bison_file(&Path::new("src/parser.y")) { Ok(_) => {} Err(BisonErr { message, .. }) => { eprintln!("Bison error:\n{}\nexiting with 1", message); std::process::exit(1); } } } ``` And so after running `cargo build` we should get `src/parser.rs` with all auto-generated and manually written code combined into a single file. You can find a full example in `tests/src/calc.y`. ## Error recovery This skeleton full matches behavior of other built-in Bison skeletons: + If you want to return an error from a derivation you can either: + do `return Ok(Self::YYERROR);` + or just `Err(())?` + If you want to completely abort execution you can: + `return Ok(Self::YYACCEPT);` to abort with success-like status code + `return Ok(Self::YYABORT);` to abort with error-like status code Once error is returned a special `error` rule can catch and "swallow" it: ```bison numbers: number { $$ = Value::NumbersList(vec![ $Number<1> ]); } | numbers number { $$ = Value::NumbersList( $1.append($2) ); } | error number { // ignore $1 and process only $2 $$ = Value::NumbersList(vec![ $Number<2> ]); } number: tNUM { $$ = $1 } | tINVALID_NUM { return Ok(Self::YYERROR); } ``` Information about the error is automatically passed to `Parser::report_syntax_error`. `Context` that it takes has methods `token()` and `location()`, so implementation of this method can look like this: ```rust fn report_syntax_error(&mut self, ctx: &Context) { let token_id: usize = ctx.token().code().try_into().unwrap(); let token_name: &'static str = Lexer::TOKEN_NAMES[id]; let error_loc: &Loc = ctx.location(); eprintln!("Unexpected token {} at {:?}", token_name, loc); } ``` ## Generic parser To make `Parser` generic you need to configure the following directive: ```bison %define api.parser.generic {} ``` This code is added to `struct Parser` and `impl Parser`: ```rust struct Parser { // ... } impl Parser { // ... } ``` If you wan to specify lifetimes make sure to fix quotes with comments: ```bison %define api.parser.generic {<'a /* 'fix quotes */, T>} ``` ## Performance You can find a `perf` example that runs a `Parser` thousands times and creates a flamegraph. ```sh $ TIMES=20000 cargo run --bin perf --release ```