| Crates.io | peggen |
| lib.rs | peggen |
| version | 0.3.2 |
| created_at | 2024-08-16 00:03:45.038809+00 |
| updated_at | 2025-04-14 03:31:43.350889+00 |
| description | Generate recursive-descent & precedence climbing parsers. |
| homepage | |
| repository | https://github.com/Y-jiji/peggen |
| max_upload_size | |
| id | 1339771 |
| size | 17,358 |
A parser generator for parsing expression grammar (PEG) that use inline macros to specify PEG operations.
This is how it looks like:
use peggen::*;
#[derive(Debug, ParseImpl, Space, Num, EnumAstImpl)]
pub enum Json {
#[rule(r"null")]
Null,
#[rule(r"{0:`false|true`}")]
Bool(bool),
#[rule(r"{0:`-?(0|[1-9][0-9]*)\.([0-9]+)`}")]
Flt(f32),
#[rule("{0:`0|-?[1-9][0-9]*`}")]
Num(i32),
#[rule(r#""{0:`[^"]*`}""#)]
Str(String),
}
fn main() {
let json = Parser::<Json>::parse("{x: 1, y: 2}").unwrap();
println!("{json:?}");
}
Help needed! There is so much to do! Contact Me
FromStr, arean allocation, and left recursion handling.| / | Conceptual | User Experience | Performance & Code Quality | Error Handling |
|---|---|---|---|---|
| PEST | PEST only annotates text. Peggen generates AST directly from your text. |
In most cases, you still want rust enums for your AST, which is directly provided by Peggen, but you have to manually create enums from PEST rules. |
PEST use an optimizer to memorize your grammar rules, and use memorization for better performance; Peggen doesn't use memorization, arguably this gives better performance over memorization for most grammars. | / |
| Chumsky | Chumsky provides parser combinators. Peggen is a parser generator. | Both Chumsky and Peggen provides ast directly. However, Peggen supports arena allocation. | Chumsky deallocates successful sub-rules when a rule fails; Peggen uses a internal representation to eliminate deallocation. Besides, Peggen handles left recursion, while in Chumsky left recursion causes in stack overflow. | / |
| LALRPOP | Peggen is PEG-based; LALRPOP uses LR(1) grammar. | Peggen is more intuitive to use than LALRPOP; LR(1) grammar is hard to extend and debug. | LALRPOP has better performance over Peggen. | LR(1) grammar can report errors far away from normally percepted cause; Peggen allows you to capture errors from customary cause. |
I roughly tested the peggen on a sample json file against chumsky.
CPU Model: Intel(R) Core(TM) i7-14700HX
Suprisingly, Peggen is faster than Chumsky.
Here are some numbers:
Before we start this tutorial, let's look at how it looks like after your first try.
#[derive(Debug, ParseImpl, Space, Num, EnumAstImpl)]
pub enum Json {
#[rule(r"null")]
Null,
#[rule(r"{0:`false|true`}")]
Bool(bool),
#[rule(r"{0:`-?(0|[1-9][0-9]*)\.([0-9]+)`}")]
Flt(f32),
#[rule("{0:`0|-?[1-9][0-9]*`}")]
Num(i32),
#[rule(r#""{0:`[^"]*`}""#)]
Str(String),
#[rule(r#"\{ [*0: "{0:`[^"]*`}" : {1} , ][?0: "{0:`[^"]*`}" : {1} ] \}"#)]
Obj(RVec<(String, Json)>),
#[rule(r"\[ [*0: {0} , ][?0: {0} ] \]")]
Arr(RVec<Json>)
}
We have a Parser type that help you use annotated enum.
use peggen::*;
fn main() {
let json = Parser::<Json>::parse("{x: 1, y: 2}").unwrap();
println!("{json:?}");
}
A rule attribute looks like the following:
#[rule("...")]
The string ensembles rust formatting string that you use in println!(). Rust formatting string represents a sequence of chars/structures printed one after another. Peggen formatting string represents a sequence of chars/parsers that eat the input string one after another.
For example, the following statement prints:
false as the first argumentand19 as the second argumentprintln!("{0} and {1}", false, 19);
You can write a parser that parses <bool> and <int> using a rule attribute. However, Peggen needs to know what kind of string can be recognized as bool and what kind of string can be recognized as i64. For this purpose, we can write regular expressions.
#[derive(Debug, ParseImpl, Space, Num, EnumAstImpl)]
#[rule("{0:`false|true`} and {1:`[0-9]+`}")]
struct BoolAndInt(bool, i64);
What will the following statement print?
println!("{:?}", Parser::<BoolAndInt>::parse("false and 19").unwrap());
BoolAndInt(false, 19);
An enum is a collection of rules, during parsing, the rules declared in an enum is tried one by one until one of them matches.
#[derive(Debug, ParseImpl, Space, Num, EnumAstImpl)]
pub enum Json {
#[rule(r"null")]
Null,
#[rule(r"{0:`false|true`}")]
Bool(bool),
#[rule(r"{0:`-?(0|[1-9][0-9]*)\.([0-9]+)`}")]
Flt(f32),
#[rule("{0:`0|-?[1-9][0-9]*`}")]
Num(i32),
#[rule(r#""{0:`[^"]*`}""#)]
Str(String),
}
" escaped to \"?"\"a string\"".#[rule(r#""{0:`([^"]|\\")*`}""#)]
Str(String)
Given that you can have multiple rules on the same enum variant, what is the alternative way of writing the Bool(bool) operation?
#[rule(r#""{0:`false`}"#)]
#[rule(r#""{0:`true`}"#)]
Bool(bool)
TODO