Crates.io | pag-lexer |
lib.rs | pag-lexer |
version | 0.1.0-alpha.1 |
source | src |
created_at | 2023-06-01 02:18:53.058315 |
updated_at | 2023-06-01 02:18:53.058315 |
description | Parser-lexer fusion generator (derivative lexer) |
homepage | |
repository | https://github.com/SchrodingerZhu/paguroidea |
max_upload_size | |
id | 879397 |
size | 3,964,828 |
Crate | Status |
---|---|
pag-lexer |
|
pag-parser |
|
pag-compiler |
A reimplementation of the Flap parser in Rust (with our own modifications applied)!
This project is still under its early-stage development. The grammar for Paguroidea is subject to change (see Issue #22). The parser generation is not thoroughly tested, which may still shake some bugs out from time to time. There are also ongoing works to improvement the quality of the generated code.
Paguroidea is a parser generator (a.k.a. the compiler of compiler). The theoretical foundation of Paguroidea is built on a few papers:
We modified the work of flap by extending DGNF with tree-generation actions, which are similar to the "reduce" operation in a traditional shift-reduce parser.
Notice: grammar for parser definitions used in this section will be changed in the near future.
It is simple: just define your grammar and pass it to our compiler. Then, Paguroidea will output a standalone parser file.
For example, a simple S-expression parser can be defined as the following
lexer {
definition BLANK = ' ';
definition DIGIT = '0' .. '9';
definition ALPHA = 'a' .. 'z' | 'A' .. 'Z';
active token LPAREN = '(';
active token RPAREN = ')';
active token ATOM = ALPHA ~ (ALPHA | DIGIT)*;
silent token WHITESPACE = (BLANK | '\t' | '\n' | '\r')+;
}
parser sexpr {
active fixpoint compound
= LPAREN ~ (compound | atom) * ~ RPAREN;
active definition atom
= ATOM;
active definition sexpr
= compound | atom;
}
You can put up your own one with the following rules:
A grammar file must contain both lexer and parser parts.
A definition
in lexer part is a macro
representing some common lexical rules. A definition
itself does not count as a token, which is similar to fragment
in ANTLR.
A lexer can atmost have one silent
token. silent
tokens will be automatically skipped during
parsing.
All rules defined in lexer part must be full uppercase.
You can use
'_'
)'a', '\x12', '๐'
)"ไฝ ๅฅฝ", "Rust"
)'A' .. 'Z'
)'a' ~ 'b'
),'a' | 'b'
)'a'?
)'a'\*
)'a'+
)!'a'
)to make up your regular expressions. Notice that complement is not negative lookahead in the common sense. Rather, it represents characters or languages complement to negated one. It is required that all active tokens cannot be nullable.
The parser part must have an entrypoint specified in the header.
Strings/characters/ranges cannot be directly used in the parser part, but parser can refer to tokens defined in lexer.
Parser rules are all in lowercase.
Most combinators in the lexer part are also supported in the parser part except for complement.
For more complicated examples, one can see json.pag.
To compile your grammar file, the recommended way is to add pag-compiler
as your build dependency. With pag-compiler
,
the parser file can be easily generated in a build script as the following:
fn main() {
pag_compiler::compile("csv.pag", "src/parser.rs");
println!("cargo:rerun-if-changed=csv.pag");
}
For some reasons (mostly performance issues), only nightly rust (1.71+) is supported for now. It is also required that the crate containing the parser file should be annotated with
#![feature(portable_simd)]
#![feature(core_intrinsics)]
#![feature(array_chunks)]
We are continuously working on improvement the quality of our generated parser. For now, on workloads of CSV/JSON, the performance is close to or even better than those specialized parsers.
=== Random Generated CSV ===
throughput/pag time: [635.88 ยตs 637.64 ยตs 639.46 ยตs]
thrpt: [622.63 MiB/s 624.41 MiB/s 626.14 MiB/s]
throughput/csv time: [528.36 ยตs 541.72 ยตs 559.54 ยตs]
thrpt: [711.56 MiB/s 734.97 MiB/s 753.55 MiB/s]
throughput/pest time: [3.7278 ms 3.7364 ms 3.7460 ms]
thrpt: [106.29 MiB/s 106.56 MiB/s 106.80 MiB/s]
=== Random Generated JSON ===
random-json/pag-json time: [22.634 ns 22.650 ns 22.666 ns]
thrpt: [84.149 MiB/s 84.209 MiB/s 84.271 MiB/s]
random-json/serde-json time: [12.493 ns 12.587 ns 12.694 ns]
thrpt: [150.26 MiB/s 151.54 MiB/s 152.67 MiB/s]
random-json/pest-json time: [177.38 ns 178.17 ns 179.17 ns]
thrpt: [10.645 MiB/s 10.705 MiB/s 10.753 MiB/s]
=== twitter.json ===
twitter-json/pag-json time: [1.0923 ms 1.0941 ms 1.0961 ms]
thrpt: [667.24 MiB/s 668.46 MiB/s 669.59 MiB/s]
twitter-json/serde-json time: [1.2281 ms 1.2295 ms 1.2312 ms]
thrpt: [594.02 MiB/s 594.88 MiB/s 595.54 MiB/s]
twitter-json/pest-json time: [5.2977 ms 5.3055 ms 5.3148 ms]
thrpt: [137.61 MiB/s 137.85 MiB/s 138.06 MiB/s]
*
, +
or
mark them rule as silent rules if possible.(BLANK | '\t' | '\n' | '\r')+
).We provide diagnostic information for "type errors" in your grammar definitions. Here are some examples:
Left-recursion
Error: Unguarded fixpoint
โญโ[json.pag:39:5]
โ
39 โ active fixpoint json = json ~ value;
โ โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโ
โ โฐโโโโโโโโโโโโโโโโโโโ fixpoint rule json is not guarded -- your grammar is left-recursive
โโโโโฏ
Sequence Ambiguity
Explanation: there may be ambiguity when separating a sequence into two part according to the grammar definition
Error: When type checking a sequence of rules, the following rules are ambiguous
โญโ[json.pag:39:28]
โ
39 โ active fixpoint test = NUMBER+ ~ NUMBER+;
โ โโโโฌโโโ โโโโฌโโโ
โ โฐโโโโโโโโโโโโโโโ type info for left-hand side: nullable: false, first set: {NUMBER}, follow set: {NUMBER}
โ โ
โ โฐโโโโโ type info for right-hand side: nullable: false, first set: {NUMBER}, follow set: {NUMBER}
โโโโโฏ
Alternation Ambiguity
Explanation: there may be ambiguity when select a match in an alternation of two rules.
Error: When type checking an alternation of rules, the following rules are ambiguous
โญโ[json.pag:39:28]
โ
39 โ active fixpoint test = NUMBER+ | NUMBER;
โ โโโโฌโโโ โโโโฌโโ
โ โฐโโโโโโโโโโโโโโ type info for left-hand side: nullable false, first set: NUMBER, follow set: NUMBER
โ โ
โ โฐโโโโ type info for right-hand side: nullable false, first set: NUMBER, follow set:
โโโโโฏ
There are other diagnostic information for undefined references, nullable tokens in lexer, character format error, etc.