Crates.io | panfix |
lib.rs | panfix |
version | 0.4.0 |
source | src |
created_at | 2023-07-27 21:53:37.8401 |
updated_at | 2023-07-27 21:53:37.8401 |
description | Panfix parsing: linear time parsing of multifix operators. |
homepage | |
repository | https://github.com/justinpombrio/panfix |
max_upload_size | |
id | 927921 |
size | 370,844 |
Panfix parsing is a new approach to parsing based on (but more expressive than) operator precedence grammars.
It is not based on Context Free Grammars (CFGs), nor on Parsing Expression Grammars (PEGs). It's has a different style of grammar, roughly a list of multifix operators. The rules for what constitute a valid grammar are extremely simple (unlike, say, LR or LALR parsing).
Panfix parsing always runs in linear time, with no dependence on the size of
the grammar. That is, if N
is the size of the text to be parsed and G
is the
size of the grammar, then a panfix parser runs in O(N)
(not O(NG)
like
Packrat PEG parsing).
It gives very lax parsers, which will happily generate a parse tree in the presense of a variety of errors. On one hand, this increases the number of cases you need to deal with when processing a parse tree. On the other hand, these extra cases are error cases and this is a good time to produce custom error messages for them.
The best introduction might be a worked example. Let's look at what it takes to parse JSON with panfix parsing. Here's a panfix grammar for JSON:
use panfix::{Parser, Grammar, GrammarError};
fn make_json_parser() -> Result<Parser, GrammarError> {
let mut grammar = Grammar::new("[ \n\r\t]+")?;
grammar.regex("String", r#""([^\\"]|(\\.))*""#)?;
grammar.regex("Number", r#"-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?"#)?;
grammar.regex("Invalid", r#"[a-zA-Z_][a-zA-Z0-9_]*"#)?; // for catching missing qutoes
grammar.string("Null", "null")?;
grammar.string("True", "true")?;
grammar.string("False", "false")?;
grammar.op("Array", pattern!("[" "]"))?;
grammar.op("Object", pattern!("{" "}"))?;
grammar.right_assoc();
grammar.op("Keyval", pattern!(_ ":" _))?;
grammar.right_assoc();
grammar.op("Comma", pattern!(_ "," _))?;
grammar.finish()
}
The regex
and string
lines are defining how to parse constants like strings
and booleans. regex
means "match this regex patternand
stringmeans "match this exact string". The "Array" line says that
[starts an array and
]` ends
it. The "Object" line is similar.
Next up is:
grammar.right_assoc();
grammar.op("Keyval", pattern!(_ ":" _))?;
grammar.right_assoc();
grammar.op("Comma", pattern!(_ "," _))?;
The two right\_assoc()
calls begin two precedence groups, each with one
operator. Precedence groups have two effects.
First, operators in groups defined earlier bind tighter than operators defined
in later groups. For example, you would define &&
in an earlier precedence
group than >
, so that x < y && y < z
parses as (x < y) && (y < z)
instead
of x < (y && y) < z
.
Second, each precedence group declares whether its operators are left or right
associative. For example, field access _._
should be declared right
associative so that person.birthday.month
is parsed as
(person.birthday).month
instead of person.(birthday.month)
. In the JSON
example, we declare :
and ,
to be right associative, but in this case left
associative would work just as well.
The pattern!(_ "," _)
declares that ,
is an operator that takes a left and
right argument, denoted by an underscore at the beginning and end. In contrast,
pattern!("[" "]")
lacks underscores and thus says that JSON arrays do not take
a left or right argument. On the other hand, an argument is always allowed
between tokens: such as between [
and ]
.
Notice some things that this grammar does not say. Nowhere does it say that
:
is only allowed inside object {...}
, nor that object keys have to be
Strings. Panfix parsing is quite unusual in this regard. You don't define the
exact grammar you want, you instead define a superset of the grammar, and
construct context-specific error messages later, as you process the parse tree.
Let's see what happens when we parse this badly formed "JSON":
{
"id": 999,
"object_class:" "safe",
"weight_kg": 54.5
"disposition": "friendly",
"diet": [
"M&Ms",
"Necco wafers",
"other sweets",
],
"interactions": {
"target_id": 682,
"effect": mixed,
}
}
as so:
let parser = make\_json\_parser().unwrap();
let input = ...read that JSON file...;
let source = Source::new("bad\_json", input);
match parser.parse(&source) { ... }
The call to parse
will actually return Ok
instead of Err
! This is because
panfix parsing is purposefully very lax. The only thing it cares about is that
the operators are complete, and in this example [
is always matched with ]
and
{
is always matched with }
. Thus parse
returns Ok(tree)
. When you Display
this tree, you get:
(Object
(Comma (Keyval "id" 999)
(Comma (_ "object_class:" "safe")
(Comma
(Keyval "weight_kg"
(Keyval (_ 54.5 "disposition") "friendly"))
(Comma
(Keyval "diet"
(Array
(Comma "M&Ms"
(Comma "Necco wafers"
(Comma "other sweets" _)))))
(Keyval "interactions"
(Object
(Comma (Keyval "target_id" 682)
(Comma (Keyval "effect" mixed) _)))))))))
The mechanism for this laxness in parsing is that two kinds of nodes were
implicitly inserted, displayed as _
above.
Blank nodes are inserted where an argument is missing. Consider, for example, the "diet" portion of the JSON:
"diet": [
"M&Ms",
"Necco wafers",
"other sweets",
],
,
is defined to take two arguments, but the argument after "other sweets",
is missing. Thus in the parse tree a Blank, written _
, is inserted:
(Array
(Comma "M&Ms"
(Comma "Necco wafers"
(Comma "other sweets" _)))))
In JSON this is an error, though in languages that allow a trailing comma it wouldn't be.
Juxtapose nodes are the complement of Blank nodes: they are inserted where there are two expressions in a row with nothing to join them. For example, this fragment of the JSON:
"weight_kg": 54.5
"disposition": "friendly",
turns into this fragment of the parse tree:
(Keyval "weight_kg"
(Keyval (_ 54.5 "disposition") "friendly"))
where the _
is a Juxtapose node, saying that 54.5
and "disposition"
were
juxtaposed (i.e., adjacent).
Just like the Blank case isn't always an error, the Juxtapose case might be
expected. For example, even in JSON the empty list []
would parse as
(Array _)
.
The combination of Blank and Juxtapose nodes makes panfix parsing very lax. This is beneficial in a way, as we'll see next.
What now? The hard work of producing a parse tree (with source locations, to
boot) has already been done. What remains is the easy but verbose work of
walking that tree and converting it into a Json
type:
#[derive(Debug)]
enum Json {
String(String),
Number(f64),
Boolean(bool),
Null,
Array(Vec<Json>),
Object(HashMap<String, Json>),
}
You can find this conversion at examples/json.rs. It's verbose but straightforward. Here's a snippet from it:
fn parse_list(&mut self, mut visitor: Visitor<'s, '_, '_>, elems: &mut Vec<Json>) {
while visitor.name() == "Comma" {
let [head, tail] = visitor.children();
let head = self.parse_value(head);
elems.push(head);
visitor = tail;
}
if visitor.name() == "Blank" {
self.error(visitor, "JSON does not allow trailing commas.");
} else {
elems.push(self.parse_value(visitor));
}
}
This may feel like busywork, but it's not. Good error messages are extremely
context dependent; panfix
does not know enough to produce quality error
messages on its own. Notice, for example, the message above: "JSON does not
allow trailing commas". That can only be hand written.
visitor
is a reference to a node in the tree. You can call .children()
to
get its children, and .error(message)
to construct an error message at that
source location.
The fact that panfix parsing is so lax comes out to shine: our JSON example produces a whole set of helpful error messages for the bad JSON we've been looking at:
Parse Error: Expected a key:value pair.
At 'stdin' line 2.
"object_class:" "safe",
^^^^^^^^^^^^^^^^^^^^^^
Parse Error: Expected a JSON value here, not a key:value pair.
At 'stdin' lines 3-4.
"weight_kg": 54.5
^^^^
"disposition": "friendly",
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Parse Error: JSON does not allow trailing commas.
At 'stdin' line 8.
"other sweets",
^
Parse Error: Missing quotes.
At 'stdin' line 12.
"effect": mixed,
^^^^^
Parse Error: JSON does not allow trailing commas.
At 'stdin' line 12.
"effect": mixed,
^
For the full example of JSON parsing, see examples/json.rs. For another example, see examples/calc.rs.
Now that you have the lay of the land, let's switch to focus on the details, requirements, and guarantees.
A panfix Grammar
consists of a whitespace regex, and a set of operators.
Everything is an operator, from numbers to identifiers to binary operators to
function definitions. Each operator has the following properties:
[a-zA-Z_]+
; parentheses for grouping
would match the sequence of tokens "(", ")"; and function definitions might
match "function", "(", ")", "{", "}"._._
should
bind tighter than array indexing _[_]
, so that catalog.entries[0]
parses
as (catalog.entries)[0]
rather than catalog.(entries[0])
.1 - 2 - 3
is equal to ((1 - 2) - 3) = -1 - 3 = -4
instead of right associative where it would be equal to (1 - (2 - 3)) = 1 - -1 = 0
._._
takes both a left and right
argument; array indexing _[_]
takes a left argument but no right argument
(because there's an argument before the first token [
but no argument
after the last token ]
); and a number like 2.5
takes neither a left nor
a right argument.You can specify these properties manually with grammar.add_raw_op(name, prec, assoc, fixity, tokens)
, but there's also a higher level interface to make this
easier:
grammar.regex(name: &str, regex_pattern: &str)
defines an operator with a
single token and no arguments. Since it has no arguments, the precedence and
associativity are irrelevant. Use this for e.g. identifiers, numbers, and
string literals.grammar.string(name: &str, string_pattern: &str)
is the same, but matches
against a literal string pattern instead of a regex pattern.grammar.left_assoc()
and grammar.right_assoc()
introduce a new operator
group. Operators defined after them belong to that operator group. Operators
in an operator group all have the same associativity, and all have the same
precedence as each other. Operators defined in earlier groups have tighter
precedence than operators defined in later groups. (If neither of these
methods is called, the default is left-associativity.)grammar.op(name: &str, pattern: Pattern)
defines an operator. It inherits
the precedence and associativity of its group, and pattern
says what it's
fixity and tokens are. In the pattern!
macro, fixity is declared by
putting a leading or trailing underscore if there is a left or right argument
respectively, and putting the tokens in the middle in quotes. For example,
pattern!(_ "[" "]")
has two tokens, and its fixity is that is has a left
argument but no right argument.A grammar must obey three simple rules:
add_raw_op
.)All three rules are enforced by the Grammar
type; you will get an error if
you violate them when you call Grammar.finish()
.
When the source is being lexed, multiple regexes from the grammar might match. The rule for which one gets used is that:
Given a panfix grammar and a source file, you can parse the source in linear time, producing either a parse error or a parse tree.
There are three kinds of parse errors:
%
in JSON.)
without any (
.(
but
no following )
.These are the only errors. If there aren't any nonsense tokens and your parentheses (and such) are matched, the parse will succeed!
If parsing is successful, it produces a parse tree. Each node in the tree has an operator and a source location. The operators in the tree include the operators defined in the grammar, plus two extra operators:
"Blank"
operator denotes a missing argument. For example, in the source
true ||
, a "Blank" operator would be inserted to the right of ||
."Juxtapose"
operator denotes that two expressions are adjacent. For
example, in the source 2 3
, a "Jutxapose" operator would be inserted
between 2
and 3
.The parse tree is guaranteed to be well formed, in the sense that:
_[_]
has two
children: 1 for its left argument and 1 for the space between its first and
second tokens.)1 + 2
will
be parsed as (+ 1 2)
and not as (Juxtapose (+ 1 Blank) 2)
.Source
, from file or stdin or whatnot.Grammar
using the builder pattern, or add_raw_op
if you need
more control. Call Grammar.finish()
to get a Parser
. This will check if
the grammar is valid and Error
if not.Parser.parse(Source)
. If there are any errors, give up and
display them.ParseTree
into an AST (or whatever your
internal representation will be). This is the time to check for errors.
Basically: walk the tree, checking the name()
of each node: if it's
expected then recur, and if it's unexpected then produce a custom error
message.Again, to see a full example, see examples/json.rs or examples/calc.rs.