plexer

Crates.io	plexer
lib.rs	plexer
version	0.1.2
source	src
created_at	2024-01-17 21:06:09.756878
updated_at	2024-01-19 23:10:22.214518
description	A Pattern-matching LEXER
homepage
repository	https://github.com/emsquid/plexer/
max_upload_size
id	1103402
size	20,620

emanuel (emsquid)

documentation

README

Pattern Lexer

My personal implementation of a lexer.

Principle

This lexer is making use of the Pattern trait to find tokens.
The idea is to create Tokens, explain how to match them with a Pattern and build them from the matched String value.

Pattern

A string Pattern trait.

The type implementing it can be used as a pattern for &str, by default it is implemented for the following types.

Pattern type	Match condition
`char`	is contained in string
`&str`	is substring
`String`	is substring
`&[char]`	any `char` match
`&[&str]`	any `&str` match
`F: Fn(&str) -> bool`	`F` returns `true` for substring (slow)
`Regex`	`Regex` match substring

Usage

The lexer! macro match the following syntax.

lexer!(
    // Ordered by priority
    NAME(optional types, ...) {
        impl Pattern => |value: String| -> Token,
        ...,
    },
    ...,
);

It generates module gen which contains Token, LexerError, LexerResult and Lexer.

You can now call Token::tokenize to tokenize a &str, it should return a Lexer instance that implements Iterator.
Each iteration, the Lexer tries to match one of the given Pattern and returns a LexerResult<Token> built from the best match.

Example

Here is an example for a simple math lexer.

lexer!(
    // Different operators
    OPERATOR(char) {
        '+' => |_| Token::OPERATOR('+'),
        '-' => |_| Token::OPERATOR('-'),
        '*' => |_| Token::OPERATOR('*'),
        '/' => |_| Token::OPERATOR('/'),
        '=' => |_| Token::OPERATOR('='),
    },
    // Integer numbers
    NUMBER(usize) {
        |s: &str| s.chars().all(|c| c.is_digit(10))
            => |v: String| Token::NUMBER(v.parse().unwrap()),
    },
    // Variable names
    IDENTIFIER(String) {
        regex!(r"[a-zA-Z_$][a-zA-Z_$0-9]*")
            => |v: String| Token::IDENTIFIER(v),
    },
    WHITESPACE {
        [' ', '\n'] => |_| Token::WHITESPACE,
    },
);

That will expand to these enum and structs.

mod gen {
    pub enum Token {
        OPERATOR(char),
        NUMBER(usize),
        IDENTIFIER(String),
        WHITESPACE,
    }

    pub struct Lexer {...}
    pub struct LexerError {...}
    pub type LexerResult<T> = Result<T, LexerError>;
}

And you can use them afterwards.

use gen::*;

let mut lex = Token::tokenize("x_4 = 1 + 3 = 2 * 2");
assert_eq!(lex.nth(2), Some(Ok(Token::OPERATOR('='))));
assert_eq!(lex.nth(5), Some(Ok(Token::NUMBER(3))));

// Our lexer doesn't handle parenthesis...
let mut err = Token::tokenize("x_4 = (1 + 3)");
assert!(err.nth(4).is_some_and(|res| res.is_err()));

Commit count: 0