# Panfix Parsing Panfix parsing is a new approach to parsing based on (but more expressive than) [operator precedence grammars](https://en.wikipedia.org/wiki/Operator-precedence_grammar). It is not based on Context Free Grammars (CFGs), nor on Parsing Expression Grammars (PEGs). It's has a **different style of grammar**, roughly a list of multifix operators. The rules for what constitute a valid grammar are extremely simple (unlike, say, LR or LALR parsing). Panfix parsing always runs in **linear time**, with no dependence on the size of the grammar. That is, if `N` is the size of the text to be parsed and `G` is the size of the grammar, then a panfix parser runs in `O(N)` (not `O(NG)` like Packrat PEG parsing). It gives very **lax** parsers, which will happily generate a parse tree in the presense of a variety of errors. On one hand, this increases the number of cases you need to deal with when processing a parse tree. On the other hand, these extra cases are error cases and this is a good time to produce custom error messages for them. ## Example: JSON The best introduction might be a worked example. Let's look at what it takes to parse JSON with panfix parsing. Here's a panfix grammar for JSON: ```rust use panfix::{Parser, Grammar, GrammarError}; fn make_json_parser() -> Result { let mut grammar = Grammar::new("[ \n\r\t]+")?; grammar.regex("String", r#""([^\\"]|(\\.))*""#)?; grammar.regex("Number", r#"-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?"#)?; grammar.regex("Invalid", r#"[a-zA-Z_][a-zA-Z0-9_]*"#)?; // for catching missing qutoes grammar.string("Null", "null")?; grammar.string("True", "true")?; grammar.string("False", "false")?; grammar.op("Array", pattern!("[" "]"))?; grammar.op("Object", pattern!("{" "}"))?; grammar.right_assoc(); grammar.op("Keyval", pattern!(_ ":" _))?; grammar.right_assoc(); grammar.op("Comma", pattern!(_ "," _))?; grammar.finish() } ``` The `regex` and `string` lines are defining how to parse constants like strings and booleans. `regex` means "match this regex pattern` and `string` means "match this exact string". The "Array" line says that `[` starts an array and `]` ends it. The "Object" line is similar. Next up is: ```rust grammar.right_assoc(); grammar.op("Keyval", pattern!(_ ":" _))?; grammar.right_assoc(); grammar.op("Comma", pattern!(_ "," _))?; ``` The two `right\_assoc()` calls begin two _precedence groups_, each with one operator. Precedence groups have two effects. First, operators in groups defined earlier bind tighter than operators defined in later groups. For example, you would define `&&` in an earlier precedence group than `>`, so that `x < y && y < z` parses as `(x < y) && (y < z)` instead of `x < (y && y) < z`. Second, each precedence group declares whether its operators are left or right associative. For example, field access `_._` should be declared right associative so that `person.birthday.month` is parsed as `(person.birthday).month` instead of `person.(birthday.month)`. In the JSON example, we declare `:` and `,` to be right associative, but in this case left associative would work just as well. The `pattern!(_ "," _)` declares that `,` is an operator that takes a left and right argument, denoted by an underscore at the beginning and end. In contrast, `pattern!("[" "]")` lacks underscores and thus says that JSON arrays do not take a left or right argument. On the other hand, an argument is always allowed _between_ tokens: such as between `[` and `]`. Notice some things that this grammar does _not_ say. Nowhere does it say that `:` is only allowed inside object `{...}`, nor that object keys have to be Strings. Panfix parsing is quite unusual in this regard. You don't define the exact grammar you want, you instead define a _superset_ of the grammar, and construct context-specific error messages later, as you process the parse tree. ### Parsing some (badly formed) JSON Let's see what happens when we parse this badly formed "JSON": ``` { "id": 999, "object_class:" "safe", "weight_kg": 54.5 "disposition": "friendly", "diet": [ "M&Ms", "Necco wafers", "other sweets", ], "interactions": { "target_id": 682, "effect": mixed, } } ``` as so: ``` let parser = make\_json\_parser().unwrap(); let input = ...read that JSON file...; let source = Source::new("bad\_json", input); match parser.parse(&source) { ... } ``` The call to `parse` will actually return `Ok` instead of `Err`! This is because panfix parsing is purposefully very lax. The only thing it cares about is that the operators are complete, and in this example `[` is always matched with `]` and `{` is always matched with `}`. Thus `parse` returns `Ok(tree)`. When you `Display` this tree, you get: ``` (Object (Comma (Keyval "id" 999) (Comma (_ "object_class:" "safe") (Comma (Keyval "weight_kg" (Keyval (_ 54.5 "disposition") "friendly")) (Comma (Keyval "diet" (Array (Comma "M&Ms" (Comma "Necco wafers" (Comma "other sweets" _))))) (Keyval "interactions" (Object (Comma (Keyval "target_id" 682) (Comma (Keyval "effect" mixed) _))))))))) ``` The mechanism for this laxness in parsing is that two kinds of nodes were implicitly inserted, displayed as `_` above. #### Blank nodes _Blank_ nodes are inserted where an argument is missing. Consider, for example, the "diet" portion of the JSON: "diet": [ "M&Ms", "Necco wafers", "other sweets", ], `,` is defined to take two arguments, but the argument _after_ `"other sweets",` is missing. Thus in the parse tree a Blank, written `_`, is inserted: (Array (Comma "M&Ms" (Comma "Necco wafers" (Comma "other sweets" _))))) In JSON this is an error, though in languages that allow a trailing comma it wouldn't be. #### Juxtapose nodes _Juxtapose_ nodes are the complement of Blank nodes: they are inserted where there are two expressions in a row with nothing to join them. For example, this fragment of the JSON: "weight_kg": 54.5 "disposition": "friendly", turns into this fragment of the parse tree: (Keyval "weight_kg" (Keyval (_ 54.5 "disposition") "friendly")) where the `_` is a Juxtapose node, saying that `54.5` and `"disposition"` were juxtaposed (i.e., adjacent). Just like the Blank case isn't always an error, the Juxtapose case might be expected. For example, even in JSON the empty list `[]` would parse as `(Array _)`. The combination of Blank and Juxtapose nodes makes panfix parsing very lax. This is beneficial in a way, as we'll see next. ### Parse tree -> JSON What now? The _hard_ work of producing a parse tree (with source locations, to boot) has already been done. What remains is the _easy_ but verbose work of walking that tree and converting it into a `Json` type: ```rust #[derive(Debug)] enum Json { String(String), Number(f64), Boolean(bool), Null, Array(Vec), Object(HashMap), } ``` You can find this conversion at [examples/json.rs](examples/json.rs). It's verbose but straightforward. Here's a snippet from it: ```rust fn parse_list(&mut self, mut visitor: Visitor<'s, '_, '_>, elems: &mut Vec) { while visitor.name() == "Comma" { let [head, tail] = visitor.children(); let head = self.parse_value(head); elems.push(head); visitor = tail; } if visitor.name() == "Blank" { self.error(visitor, "JSON does not allow trailing commas."); } else { elems.push(self.parse_value(visitor)); } } ``` This may feel like busywork, but it's not. Good error messages are extremely context dependent; `panfix` does not know enough to produce quality error messages on its own. Notice, for example, the message above: "JSON does not allow trailing commas". That can only be hand written. `visitor` is a reference to a node in the tree. You can call `.children()` to get its children, and `.error(message)` to construct an error message at that source location. The fact that panfix parsing is so lax comes out to shine: our JSON example produces a whole set of helpful error messages for the bad JSON we've been looking at: ``` Parse Error: Expected a key:value pair. At 'stdin' line 2. "object_class:" "safe", ^^^^^^^^^^^^^^^^^^^^^^ Parse Error: Expected a JSON value here, not a key:value pair. At 'stdin' lines 3-4. "weight_kg": 54.5 ^^^^ "disposition": "friendly", ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Parse Error: JSON does not allow trailing commas. At 'stdin' line 8. "other sweets", ^ Parse Error: Missing quotes. At 'stdin' line 12. "effect": mixed, ^^^^^ Parse Error: JSON does not allow trailing commas. At 'stdin' line 12. "effect": mixed, ^ ``` For the full example of JSON parsing, see [examples/json.rs](examples/json.rs). For another example, see [examples/calc.rs](examples/calc.rs). ## Specification Now that you have the lay of the land, let's switch to focus on the details, requirements, and guarantees. ### Panfix Grammars A panfix `Grammar` consists of a whitespace regex, and a set of _operators_. Everything is an operator, from numbers to identifiers to binary operators to function definitions. Each operator has the following properties: - _Name_ gives the operator a name, so that you can identify it in the parse tree later. - _Tokens_ say what sequence of tokens the operator matches. For example: identifiers might match the regex token `[a-zA-Z_]+`; parentheses for grouping would match the sequence of tokens "(", ")"; and function definitions might match "function", "(", ")", "{", "}". - _Precedence_ says how tightly it binds. For example, field access `_._` should bind tighter than array indexing `_[_]`, so that `catalog.entries[0]` parses as `(catalog.entries)[0]` rather than `catalog.(entries[0])`. - _Associativity_ says how it should group with itself. For example, minus should be _left associative_ so that `1 - 2 - 3` is equal to `((1 - 2) - 3) = -1 - 3 = -4` instead of _right associative_ where it would be equal to `(1 - (2 - 3)) = 1 - -1 = 0`. - _Fixity_ says whether it takes a left argument and whether it takes a right argument. For example: field access `_._` takes both a left and right argument; array indexing `_[_]` takes a left argument but no right argument (because there's an argument _before the first token `[`_ but no argument _after the last token `]`_); and a number like `2.5` takes neither a left nor a right argument. You _can_ specify these properties manually with `grammar.add_raw_op(name, prec, assoc, fixity, tokens)`, but there's also a higher level interface to make this easier: - `grammar.regex(name: &str, regex_pattern: &str)` defines an operator with a single token and no arguments. Since it has no arguments, the precedence and associativity are irrelevant. Use this for e.g. identifiers, numbers, and string literals. - `grammar.string(name: &str, string_pattern: &str)` is the same, but matches against a literal string pattern instead of a regex pattern. - `grammar.left_assoc()` and `grammar.right_assoc()` introduce a new _operator group_. Operators defined after them belong to that operator group. Operators in an operator group all have the same associativity, and all have the same precedence as each other. Operators defined in earlier groups have tighter precedence than operators defined in later groups. (If neither of these methods is called, the default is left-associativity.) - `grammar.op(name: &str, pattern: Pattern)` defines an operator. It inherits the precedence and associativity of its group, and `pattern` says what it's _fixity_ and _tokens_ are. In the `pattern!` macro, fixity is declared by putting a leading or trailing underscore if there is a left or right argument respectively, and putting the tokens in the middle in quotes. For example, `pattern!(_ "[" "]")` has two tokens, and its fixity is that is has a left argument but no right argument. A grammar must obey three simple rules: 1. Its regexs must be well formed. 2. At most two operators may start with the same token: one that has a left argument, and one that does not. (For example, this allows having an operator for both unary and binary minus: they both start with the token "-", but binary minus has a left argument and unary minus does not.) 3. Operators with the same precedence must have the same associativity. (This rule is enforced by the builder pattern, so you only need to think about it if you use `add_raw_op`.) All three rules are enforced by the `Grammar` type; you will get an error if you violate them when you call `Grammar.finish()`. ### Lexing When the source is being lexed, multiple regexes from the grammar might match. The rule for which one gets used is that: - The longer match wins, - or failing that a string pattern wins over a regex pattern, - or failing that the pattern defined earlier in the grammar wins. ### Parsing Given a panfix grammar and a source file, you can parse the source in _linear time_, producing either a parse error or a parse tree. There are three kinds of parse errors: - A lexing error. For example, encountering the character `%` in JSON. - An unexpected token. For example, encountering `)` without any `(`. - An operator began but was not completed. For example, encountering an `(` but no following `)`. These are the _only_ errors. If there aren't any nonsense tokens and your parentheses (and such) are matched, the parse will succeed! If parsing is successful, it produces a _parse tree_. Each node in the tree has an operator and a source location. The operators in the tree include the operators defined in the grammar, plus two extra operators: - The `"Blank"` operator denotes a missing argument. For example, in the source `true ||`, a "Blank" operator would be inserted to the right of `||`. - The `"Juxtapose"` operator denotes that two expressions are adjacent. For example, in the source `2 3`, a "Jutxapose" operator would be inserted between `2` and `3`. The parse tree is guaranteed to be well formed, in the sense that: 1. Each node has children in this order: (i) one child for its left argument, if it takes one; (ii) one child for each space "between" its tokens; and (iii) one child for its right argument, if any. Blank nodes have zero children and Juxtapose nodes have two. (For example, array indexing `_[_]` has two children: 1 for its left argument and 1 for the space between its first and second tokens.) 2. If a node has a left argument, then either it has looser precedence than its first child, or it is left associative and has equal precedence to its first child. Likewise, if it has a right argument then either it has looser precedence than its last child, or it is right associative and has equal precedence. 3. There aren't "too many" blanks and juxtaposes. For example, `1 + 2` will be parsed as `(+ 1 2)` and not as `(Juxtapose (+ 1 Blank) 2)`. ### Putting it all together 1. Construct a `Source`, from file or stdin or whatnot. 2. Construct a `Grammar` using the builder pattern, or `add_raw_op` if you need more control. Call `Grammar.finish()` to get a `Parser`. This will check if the grammar is valid and `Error` if not. 3. Parse using `Parser.parse(Source)`. If there are any errors, give up and display them. 4. If that succeeded, convert the `ParseTree` into an AST (or whatever your internal representation will be). _This is the time to check for errors_. Basically: walk the tree, checking the `name()` of each node: if it's expected then recur, and if it's unexpected then produce a custom error message. Again, to see a full example, see [examples/json.rs](examples/json.rs) or [examples/calc.rs](examples/calc.rs).