monster-regex

Crates.io	monster-regex
lib.rs	monster-regex
version	0.2.2
created_at	2025-12-31 06:17:15.793536+00
updated_at	2026-01-09 04:19:58.664197+00
description	A custom regex spec
homepage
repository	https://github.com/monster0506/monster-regex
max_upload_size
id	2013988
size	1,165,053

TJ Raklovits (Monster0506)

documentation

https://docs.rs/monster-regex

README

Rift Search Specification

This document outlines the regular expression syntax and features supported by Rift's search engine.

Usage

Add monster-regex to your Cargo.toml:

[dependencies]
monster-regex = "0.2.2"

Basic Example (Backtracking Engine)

By default, Regex::new uses the BacktrackingRegexEngine. This engine supports advanced features like lookarounds and backreferences but may have exponential runtime on pathological patterns.

use monster_regex::{Regex, Flags};

fn main() {
    // Compile using the default backtracking engine
    let re = Regex::new(r"\w+", Flags::default()).unwrap();

    assert!(re.is_match("hello"));
    
    // Find a match
    if let Some(m) = re.find("hello world") {
        println!("Found match at {}-{}", m.start, m.end); // 0-5
    }
}

Linear Engine (O(n))

For performance-critical code where O(n) guarantees are required, use the LinearRegexEngine (based on PikeVM). Note that this engine does not support lookarounds or backreferences.

use monster_regex::{Regex, Flags};

fn main() {
    // Explicit constructor for the linear engine
    let re = Regex::new_linear(r"a.*b", Flags::default()).unwrap();
    assert!(re.is_match("abbb"));
}

Dynamic Engine Dispatch

You can switch between engines at runtime using AnyRegexEngine. This allows you to choose the best engine for the pattern or use case.

use monster_regex::engine::{
    AnyRegexEngine, RegexEngine, CompiledRegex, 
    backtracking::BacktrackingRegexEngine, 
    linear::LinearRegexEngine
};
use monster_regex::Flags;

fn main() {
    let use_linear = true;
    let flags = Flags::default();
    let pattern = "abc";

    // Type-erased engine trait object
    let engine: Box<dyn RegexEngine<Regex = Box<dyn CompiledRegex>>> = if use_linear {
        Box::new(AnyRegexEngine(LinearRegexEngine))
    } else {
        Box::new(AnyRegexEngine(BacktrackingRegexEngine))
    };

    // Compile returns a Box<dyn CompiledRegex>
    let regex = engine.compile(pattern, flags).unwrap();
    
    assert!(regex.is_match("abc"));
}

Architecture & Traits

monster-regex exposes two key traits for compiled regexes:

CompiledRegex: Object-safe trait containing core methods (is_match, find, captures, replace). Usable with &str. This is the return type when using dynamic dispatch.
CompiledRegexHaystack: Generic trait extending CompiledRegex for streaming support via the Haystack trait. Not object-safe.

When using dynamic dispatch (Box<dyn CompiledRegex>), you are limited to the methods in CompiledRegex (string-based) and cannot use the streaming Haystack API directly on the trait object.

Using Flags

You can configure behavior using Flags:

use monster_regex::{Regex, Flags};

fn main() {
    let mut flags = Flags::default();
    flags.ignore_case = Some(true); // Case insensitive
    flags.multiline = true;         // ^ and $ match line boundaries

    let re = Regex::new(r"^hello", flags).unwrap();
    assert!(re.is_match("HELLO\nworld"));
}

Parsing Rift Format

You can also parse patterns in the pattern/flags format used by Rift:

use monster_regex::parse_rift_format;
use monster_regex::Regex;

fn main() {
    let (pattern, flags) = parse_rift_format("abc/i").unwrap();
    let re = Regex::new(&pattern, flags).unwrap();

    assert!(re.is_match("ABC"));
}

Find All

use monster_regex::{Regex, Flags};

fn main() {
    let re = Regex::new(r"\d+", Flags::default()).unwrap();
    let text = "123 abc 456";

    for m in re.find_all(text) {
        println!("Match: {}", &text[m.start..m.end]);
    }
}

Replacement

use monster_regex::{Regex, Flags};

fn main() {
    let re = Regex::new(r"foo", Flags::default()).unwrap();
    
    // Replace first occurrence only
    let result = re.replace("foo bar foo", "baz");
    assert_eq!(result, "baz bar foo");
    
    // Replace all occurrences
    let result = re.replace_all("foo bar foo", "baz");
    assert_eq!(result, "baz bar baz");
}

Captures Iterator

use monster_regex::{Regex, Flags};

fn main() {
    let re = Regex::new(r"(\w+)@(\w+)", Flags::default()).unwrap();
    let text = "alice@home bob@work";

    for caps in re.captures_all(text) {
        println!("Full match: {:?}", caps.full_match);
        println!("Groups: {:?}", caps.groups);
    }
}

Inspecting Pattern and Flags

use monster_regex::{Regex, Flags};

fn main() {
    let mut flags = Flags::default();
    flags.ignore_case = Some(true);
    
    let re = Regex::new(r"hello", flags).unwrap();
    
    // Access the original pattern
    assert_eq!(re.pattern(), "hello");
    
    // Access the flags used during compilation
    assert_eq!(re.flags().ignore_case, Some(true));
}

Streaming / Zero-Copy Search

For advanced use cases like searching non-contiguous memory (ropes, gap buffers) without allocation, implement the Haystack trait:

use monster_regex::{Regex, Haystack};

#[derive(Copy, Clone)]
struct MyRope<'a> {
    // ... custom internal structure
    phantom: std::marker::PhantomData<&'a ()>,
}

impl<'a> Haystack for MyRope<'a> {
    fn len(&self) -> usize { /* ... */ }
    fn char_at(&self, pos: usize) -> Option<(char, usize)> { /* ... */ }
    fn char_before(&self, pos: usize) -> Option<char> { /* ... */ }
    fn matches_range(&self, pos: usize, other_start: usize, other_end: usize) -> bool { /* ... */ }
    fn starts_with(&self, pos: usize, literal: &str) -> bool { /* ... */ }
}

fn main() {
    let rope = MyRope { /* ... */ };
    let re = Regex::new("pattern", Default::default()).unwrap();

    // Check if pattern matches anywhere
    if re.is_match_from(rope) {
        println!("Found a match!");
    }

    // Find first match
    if let Some(m) = re.find_from(rope) {
         println!("Match at {}-{}", m.start, m.end);
    }
    
    // Find match starting at a specific offset
    if let Some(m) = re.find_from_at(rope, 10) {
        println!("Match starting from offset 10: {}-{}", m.start, m.end);
    }

    // Iterate all matches
    for m in re.find_all_from(rope) {
        // ...
    }
}

1. General Syntax

Search patterns are entered in the format: pattern/flags

Pattern: The regex to match.
Flags: Optional single-character flags modifying the search behavior.

Special Characters

The following characters have special meaning and must be escaped with \ to be matched literally: . * + ? ^ $ | ( ) [ ] { } \

All other characters match themselves literally.

Note on Dot (.): By default, . matches any character except newline. Use the s (dotall) flag to make . match newlines.

Case Sensitivity

Default (Smartcase): Case-insensitive if the pattern contains only lowercase letters. Case-sensitive if the pattern contains any uppercase letters.
Overrides: Can be explicitly set using the i (ignore-case) or c (case-sensitive) flags.

2. Quantifiers

Quantifiers specify how many times the preceding atom (character, group, or character class) should match.

Quantifier	Meaning	Greedy?	Example
`*`	0 or more	Yes	`a*` matches "", "a", "aa"...
`+`	1 or more	Yes	`a+` matches "a", "aa"...
`?`	0 or 1	Yes (prefers 1)	`a?` matches "" or "a", preferring "a"
`{n}`	Exactly n	—	`a{3}` matches "aaa"
`{n,m}`	n to m	Yes	`a{2,4}` matches "aa", "aaa", "aaaa"
`{n,}`	n or more	Yes	`a{2,}` matches "aa", "aaa"...
`{,m}`	0 to m	Yes	`a{,3}` matches "", "a", "aa", "aaa"
`*?`	0 or more	No	`a*?` matches minimal characters
`+?`	1 or more	No	`a+?` matches minimal characters
`??`	0 or 1	No	`a??` prefers 0 matches
`{n,m}?`	n to m	No	`a{2,4}?` matches "aa" before "aaa"

3. Character Classes

Standard Classes

Class	Matches
`\d`	Digit `[0-9]`
`\D`	Non-digit
`\w`	Word character `[a-zA-Z0-9_]` (ASCII by default)
`\W`	Non-word character
`\s`	Whitespace `[ \t\r\n\f\v]`
`\S`	Non-whitespace

Extended Classes

Class	Matches
`\l`	Lowercase character
`\L`	Non-lowercase character
`\u`	Uppercase character
`\U`	Non-uppercase character
`\x`	Hexadecimal digit
`\X`	Non-hexadecimal digit
`\o`	Octal digit
`\O`	Non-octal digit
`\h`	Head of word character (start of a word)
`\H`	Non-head of word character
`\p`	Punctuation `[!"#$%&'()*+,\-./:;<=>?@\[\\\]^_`{
`\P`	Non-punctuation
`\a`	Alphanumeric `[a-zA-Z0-9]`
`\A`	Non-alphanumeric

Unicode Support

Default: \w, \d, \s, \h match ASCII characters only.
With u flag: These classes include Unicode characters (e.g., \w matches accented characters).

Character Sets

Custom character sets and ranges (e.g., [a-z], [^0-9]) are supported.

Note on Escaping in Character Classes: In character classes, special meaning is different. For example, [\]] matches a literal ], and [a\-z] matches a, \, or -.

4. Anchors and Boundaries

Anchors assert a position without matching characters (zero-width).

Anchor	Meaning
`^`	Start of string (or start of line in multiline mode)
`$`	End of string (or end of line in multiline mode)
`\<`	Start of word
`\>`	End of word
`\b`	Word boundary (matches at `\<` or `\>`)
`\zs`	Sets the start of the match (everything before is excluded from the result)
`\ze`	Sets the end of the match (everything after is excluded from the result)

Position Anchors

These anchors match at a specific position in the buffer. They are zero-width assertions and do not consume characters.

Anchor	Meaning	Example
`\%nl`	Matches anywhere on line n (1-indexed).	`\%5lfoo` matches "foo" only if it appears on line 5.

Word Boundaries Explained

\<: Matches the position where a word starts (preceded by non-word, followed by word char).
\>: Matches the position where a word ends (preceded by word char, followed by non-word).
\b: Matches at either \< or \>.

Word boundaries \< and \> use the same character definition as \w ([a-zA-Z0-9_]). With the u flag, both adapt to Unicode.

5. Flags

Flags are appended after the pattern delimiter (e.g., pattern/flags).

Flag	Name	Description
`i`	ignore-case	Case-insensitive matching (overrides smartcase).
`c`	case-sensitive	Case-sensitive matching (overrides smartcase).
`m`	multiline	`^` and `$` match line boundaries (`\n`), not just the start/end of the entire buffer.
`s`	dotall	`.` matches newlines (including end-of-line).
`x`	verbose	Whitespace and `#` comments in the pattern are ignored. Literal spaces must be escaped (e.g., `\` or `[ ]`).
`g`	global	Match all occurrences (used for find-all or replace operations).
`u`	unicode	Enables Unicode support for character classes (`\w`, `\d`, etc.).

Verbose Mode Examples (x flag):

/foo bar/x matches "foobar" (space is ignored).
/foo\ bar/x matches "foo bar" (space is escaped).
/foo[ ]bar/x matches "foo bar" (space in bracket).

6. Escape Sequences

Sequence	Matches
`\n`	Newline (LF)
`\t`	Tab
`\r`	Carriage return (CR)
`\f`	Form feed
`\v`	Vertical tab
`\\`	Literal backslash

7. Groups, Alternation, and Assertions

Alternation: pattern1|pattern2 matches either pattern1 or pattern2.
Grouping: (pattern) groups part of the regex and captures it.
Named Capture: (?<name>pattern) captures the group with a specific name.
Non-Capturing Group: (?:pattern) groups without capturing.
Backreferences: \1 through \9 refer to captured groups 1-9. \0 refers to the entire match.

Lookaround Assertions

Lookarounds assert that what follows or precedes the current position matches a pattern, without including it in the match result.

Assertion	Type	Meaning
`(?>=foo)`	Positive Lookahead	Matches if followed by "foo".
`(?>!foo)`	Negative Lookahead	Matches if not followed by "foo".
`(?<=foo)`	Positive Lookbehind	Matches if preceded by "foo".
`(?<!foo)`	Negative Lookbehind	Matches if not preceded by "foo".

Commit count: 49