parlex-gen

Crates.ioparlex-gen
lib.rsparlex-gen
version0.3.0
created_at2025-09-22 20:02:31.679296+00
updated_at2025-10-23 10:16:39.443741+00
descriptionLexer generator ALEX and parser generator ASLR
homepage
repositoryhttps://github.com/ikhomyakov/parlex.git
max_upload_size
id1850589
size124,960
Igor Y. Khomyakov (ikhomyakov)

documentation

README

parlex-gen

Crates.io Documentation License: LGPL-3.0-or-later Rust

Lexer generator ALEX and parser generator ASLR.

Overview

parlex-gen is the companion crate to parlex, providing the ALEX lexer generator and the ASLR parser generator. Together, these tools form the code generation component of the Parlex framework, enabling the automatic construction of efficient lexical analyzers and parsers in Rust.

The system is inspired by the classic lex (flex) and yacc (bison) utilities written for C, but provides a Rust-based implementation that is more composable and improves upon ambiguity resolution. Unlike lex and yacc, which mix custom user code with automatically generated code, Parlex cleanly separates the two: grammar rules and lexer definitions are explicitly named, and user code refers to them by name.

The ALEX lexer generator offers expressive power comparable to that of lex or flex. It leverages Rust’s standard regular expression libraries to construct deterministic finite automata (DFAs) that operate efficiently at runtime to recognize permitted lexical patterns. The system supports multiple lexical states, enabling context-sensitive tokenization.

The ASLR parser generator implements the SLR(1) parsing algorithm, which is somewhat less general than the LALR(1) method employed by yacc and bison. Nevertheless, ASLR introduces a significant enhancement: it supports dynamic runtime resolution of shift/reduce ambiguities, offering greater flexibility in domains such as Prolog, where operator definitions may be introduced or redefined at runtime.

Lexers and parsers generated by the parlex-gen tools depend on the parlex core library, which provides the traits, data structures, and runtime support necessary for their execution. Users define their grammars and lexical rules declaratively, invoke ALEX and ASLR to generate Rust source code, and integrate the resulting components with application logic through the abstractions provided by parlex.

Usage

Add this to your Cargo.toml:

[build-dependencies]
parlex-gen = "0.3"

You'll also need the core library:

[dependencies]
parlex = "0.3"

Lexer and Parser Generation with alex and aslr

Define your lexer in lexer.alex and your grammar in parser.g, then run the ALEX and ASLR generators to produce the corresponding Rust source files.

A typical build.rs script might look like this:

// In your build.rs
use std::path::PathBuf;
use parlex_gen::{alex, aslr};

fn main() {
    let manifest_dir = std::env::var("CARGO_MANIFEST_DIR").unwrap();
    let out_dir = PathBuf::from(std::env::var("OUT_DIR").unwrap());

    // --- ALEX Lexer Generation ---
    let input_file = PathBuf::from(&manifest_dir).join("src/lexer.alex");
    println!("cargo:rerun-if-changed={}", input_file.display());
    println!("cargo:warning=ALEX input file: {}", input_file.display());
    println!("cargo:warning=ALEX output directory: {}", out_dir.display());
    alex::generate(&input_file, &out_dir, "lexer_data", false).unwrap();

    // --- ASLR Parser Generation ---
    let input_file = PathBuf::from(&manifest_dir).join("src/parser.g");
    println!("cargo:rerun-if-changed={}", input_file.display());
    println!("cargo:warning=ASLR input file: {}", input_file.display());
    println!("cargo:warning=ASLR output directory: {}", out_dir.display());
    aslr::generate(&input_file, &out_dir, "parser_data", false).unwrap();
}

Alex Lexer Specification Format

The Alex specification defines lexical rules for recognizing the textual structure of a language before parsing. It describes how to match the components of tokens — such as identifiers, numbers, delimiters, operators, and string or block contents — using regular expressions and lexical states.

Structure

An Alex specification contains:

  1. Macro definitions Named regular expressions declared as:

    NAME = regex
    

    Macros can be referenced with {{NAME}} inside other patterns. They are used to build complex rules from smaller reusable fragments (e.g., {{DEC}}, {{ATOM}}, {{VAR}}).

  2. Lexical rules Each rule specifies what pattern to match and in which lexical state it applies:

    RuleName: <State1, State2> pattern
    

    These rules describe low-level recognition of language elements — not yet semantic tokens, but the raw lexical building blocks.

  3. Lexical states States define contexts that control which rules are active at any time. The lexer can switch states dynamically, allowing it to handle nested or context-dependent structures (for example, strings, comments, or embedded data blocks). A * in the state list indicates that the corresponding regular expression rule is active in all lexical states.

Example

WS = [ \t]
NL = \r?\n
IDENT = [a-z_][a-z_A-Z0-9]*
NUMBER = [0-9]+

Ident: <Expr> {{IDENT}}
Number: <Expr> {{NUMBER}}
Semicolon: <Expr> ;
Equals: <Expr> =
Plus: <Expr> \+
Minus: <Expr> -
Asterisk: <Expr> \*
Slash: <Expr> /
LeftParen: <Expr> \(
RightParen: <Expr> \)
CommentBegin: <Expr, Comment> /\*
CommentEnd: <Comment> \*/
CommentChar: <Comment> [^*\r\n]+
NewLine: <*> {{NL}}
WhiteSpace: <Expr> {{WS}}+
Error: <*> .

Note: The first lexical state encountered in the specification file is used as the starting lexer state (in this case, Expr).

ASLR Grammar Specification Format

An ASLR specification defines a context-free grammar for use with the aslr SLR(1) parser generator. It consists of production rules, written in a simple, line-oriented format:

rule_name:   Nonterminal -> Symbol Symbol ...
  • Each line defines one production.
  • The left-hand side (LHS) is a nonterminal being defined.
  • The right-hand side (RHS) lists terminals and/or nonterminals.
  • An empty RHS represents an ε-production (the symbol can derive nothing).
  • Multiple alternative productions for the same nonterminal are written as separate rules.
  • Grammars can express nested and recursive definitions suitable for SLR(1) parsing.

Naming Rules

  • Rule names follow the pattern: [a-z]([a-zA-Z0-9])*

  • Nonterminals use capitalized names (e.g., Expr, Term, Seq).

  • Terminals follow either:

    • [a-z]([a-zA-Z0-9])* — for word-like tokens, or
    • one of the special symbols below, which are translated to lowercase names.
. dot - minus ~ tilde ` backtick
! exclamation @ at # hash $ dollar
% percent ^ caret & ampersand * asterisk
+ plus = equals | pipe \\ backslash
< lessThan > greaterThan ? question / slash
; semicolon ( leftParen ) rightParen [ leftBrack
] rightBrack { leftBrace } rightBrace , comma
' singleQuote " doubleQuote : colon

Example

stat1: Stat ->
stat2: Stat -> Expr
stat3: Stat -> ident = Expr
expr1: Expr -> number
expr2: Expr -> ident
expr3: Expr -> Expr + Expr
expr4: Expr -> Expr - Expr
expr5: Expr -> Expr * Expr
expr6: Expr -> Expr / Expr
expr7: Expr -> - Expr
expr8: Expr -> ( Expr )

License

Copyright (c) 2005–2025 IKH Software, Inc.

Released under the terms of the GNU Lesser General Public License, version 3.0 or (at your option) any later version (LGPL-3.0-or-later).

See Also

Commit count: 0

cargo fmt