monster-regex

Crates.iomonster-regex
lib.rsmonster-regex
version0.2.2
created_at2025-12-31 06:17:15.793536+00
updated_at2026-01-09 04:19:58.664197+00
descriptionA custom regex spec
homepage
repositoryhttps://github.com/monster0506/monster-regex
max_upload_size
id2013988
size1,165,053
TJ Raklovits (Monster0506)

documentation

https://docs.rs/monster-regex

README

Rift Search Specification

This document outlines the regular expression syntax and features supported by Rift's search engine.

Usage

Add monster-regex to your Cargo.toml:

[dependencies]
monster-regex = "0.2.2"

Basic Example (Backtracking Engine)

By default, Regex::new uses the BacktrackingRegexEngine. This engine supports advanced features like lookarounds and backreferences but may have exponential runtime on pathological patterns.

use monster_regex::{Regex, Flags};

fn main() {
    // Compile using the default backtracking engine
    let re = Regex::new(r"\w+", Flags::default()).unwrap();

    assert!(re.is_match("hello"));
    
    // Find a match
    if let Some(m) = re.find("hello world") {
        println!("Found match at {}-{}", m.start, m.end); // 0-5
    }
}

Linear Engine (O(n))

For performance-critical code where O(n) guarantees are required, use the LinearRegexEngine (based on PikeVM). Note that this engine does not support lookarounds or backreferences.

use monster_regex::{Regex, Flags};

fn main() {
    // Explicit constructor for the linear engine
    let re = Regex::new_linear(r"a.*b", Flags::default()).unwrap();
    assert!(re.is_match("abbb"));
}

Dynamic Engine Dispatch

You can switch between engines at runtime using AnyRegexEngine. This allows you to choose the best engine for the pattern or use case.

use monster_regex::engine::{
    AnyRegexEngine, RegexEngine, CompiledRegex, 
    backtracking::BacktrackingRegexEngine, 
    linear::LinearRegexEngine
};
use monster_regex::Flags;

fn main() {
    let use_linear = true;
    let flags = Flags::default();
    let pattern = "abc";

    // Type-erased engine trait object
    let engine: Box<dyn RegexEngine<Regex = Box<dyn CompiledRegex>>> = if use_linear {
        Box::new(AnyRegexEngine(LinearRegexEngine))
    } else {
        Box::new(AnyRegexEngine(BacktrackingRegexEngine))
    };

    // Compile returns a Box<dyn CompiledRegex>
    let regex = engine.compile(pattern, flags).unwrap();
    
    assert!(regex.is_match("abc"));
}

Architecture & Traits

monster-regex exposes two key traits for compiled regexes:

  1. CompiledRegex: Object-safe trait containing core methods (is_match, find, captures, replace). Usable with &str. This is the return type when using dynamic dispatch.
  2. CompiledRegexHaystack: Generic trait extending CompiledRegex for streaming support via the Haystack trait. Not object-safe.

When using dynamic dispatch (Box<dyn CompiledRegex>), you are limited to the methods in CompiledRegex (string-based) and cannot use the streaming Haystack API directly on the trait object.

Using Flags

You can configure behavior using Flags:

use monster_regex::{Regex, Flags};

fn main() {
    let mut flags = Flags::default();
    flags.ignore_case = Some(true); // Case insensitive
    flags.multiline = true;         // ^ and $ match line boundaries

    let re = Regex::new(r"^hello", flags).unwrap();
    assert!(re.is_match("HELLO\nworld"));
}

Parsing Rift Format

You can also parse patterns in the pattern/flags format used by Rift:

use monster_regex::parse_rift_format;
use monster_regex::Regex;

fn main() {
    let (pattern, flags) = parse_rift_format("abc/i").unwrap();
    let re = Regex::new(&pattern, flags).unwrap();

    assert!(re.is_match("ABC"));
}

Find All

use monster_regex::{Regex, Flags};

fn main() {
    let re = Regex::new(r"\d+", Flags::default()).unwrap();
    let text = "123 abc 456";

    for m in re.find_all(text) {
        println!("Match: {}", &text[m.start..m.end]);
    }
}

Replacement

use monster_regex::{Regex, Flags};

fn main() {
    let re = Regex::new(r"foo", Flags::default()).unwrap();
    
    // Replace first occurrence only
    let result = re.replace("foo bar foo", "baz");
    assert_eq!(result, "baz bar foo");
    
    // Replace all occurrences
    let result = re.replace_all("foo bar foo", "baz");
    assert_eq!(result, "baz bar baz");
}

Captures Iterator

use monster_regex::{Regex, Flags};

fn main() {
    let re = Regex::new(r"(\w+)@(\w+)", Flags::default()).unwrap();
    let text = "alice@home bob@work";

    for caps in re.captures_all(text) {
        println!("Full match: {:?}", caps.full_match);
        println!("Groups: {:?}", caps.groups);
    }
}

Inspecting Pattern and Flags

use monster_regex::{Regex, Flags};

fn main() {
    let mut flags = Flags::default();
    flags.ignore_case = Some(true);
    
    let re = Regex::new(r"hello", flags).unwrap();
    
    // Access the original pattern
    assert_eq!(re.pattern(), "hello");
    
    // Access the flags used during compilation
    assert_eq!(re.flags().ignore_case, Some(true));
}

Streaming / Zero-Copy Search

For advanced use cases like searching non-contiguous memory (ropes, gap buffers) without allocation, implement the Haystack trait:

use monster_regex::{Regex, Haystack};

#[derive(Copy, Clone)]
struct MyRope<'a> {
    // ... custom internal structure
    phantom: std::marker::PhantomData<&'a ()>,
}

impl<'a> Haystack for MyRope<'a> {
    fn len(&self) -> usize { /* ... */ }
    fn char_at(&self, pos: usize) -> Option<(char, usize)> { /* ... */ }
    fn char_before(&self, pos: usize) -> Option<char> { /* ... */ }
    fn matches_range(&self, pos: usize, other_start: usize, other_end: usize) -> bool { /* ... */ }
    fn starts_with(&self, pos: usize, literal: &str) -> bool { /* ... */ }
}

fn main() {
    let rope = MyRope { /* ... */ };
    let re = Regex::new("pattern", Default::default()).unwrap();

    // Check if pattern matches anywhere
    if re.is_match_from(rope) {
        println!("Found a match!");
    }

    // Find first match
    if let Some(m) = re.find_from(rope) {
         println!("Match at {}-{}", m.start, m.end);
    }
    
    // Find match starting at a specific offset
    if let Some(m) = re.find_from_at(rope, 10) {
        println!("Match starting from offset 10: {}-{}", m.start, m.end);
    }

    // Iterate all matches
    for m in re.find_all_from(rope) {
        // ...
    }
}

1. General Syntax

Search patterns are entered in the format: pattern/flags

  • Pattern: The regex to match.
  • Flags: Optional single-character flags modifying the search behavior.

Special Characters

The following characters have special meaning and must be escaped with \ to be matched literally: . * + ? ^ $ | ( ) [ ] { } \

All other characters match themselves literally.

Note on Dot (.): By default, . matches any character except newline. Use the s (dotall) flag to make . match newlines.

Case Sensitivity

  • Default (Smartcase): Case-insensitive if the pattern contains only lowercase letters. Case-sensitive if the pattern contains any uppercase letters.
  • Overrides: Can be explicitly set using the i (ignore-case) or c (case-sensitive) flags.

2. Quantifiers

Quantifiers specify how many times the preceding atom (character, group, or character class) should match.

Quantifier Meaning Greedy? Example
* 0 or more Yes a* matches "", "a", "aa"...
+ 1 or more Yes a+ matches "a", "aa"...
? 0 or 1 Yes (prefers 1) a? matches "" or "a", preferring "a"
{n} Exactly n a{3} matches "aaa"
{n,m} n to m Yes a{2,4} matches "aa", "aaa", "aaaa"
{n,} n or more Yes a{2,} matches "aa", "aaa"...
{,m} 0 to m Yes a{,3} matches "", "a", "aa", "aaa"
*? 0 or more No a*? matches minimal characters
+? 1 or more No a+? matches minimal characters
?? 0 or 1 No a?? prefers 0 matches
{n,m}? n to m No a{2,4}? matches "aa" before "aaa"

3. Character Classes

Standard Classes

Class Matches
\d Digit [0-9]
\D Non-digit
\w Word character [a-zA-Z0-9_] (ASCII by default)
\W Non-word character
\s Whitespace [ \t\r\n\f\v]
\S Non-whitespace

Extended Classes

Class Matches
\l Lowercase character
\L Non-lowercase character
\u Uppercase character
\U Non-uppercase character
\x Hexadecimal digit
\X Non-hexadecimal digit
\o Octal digit
\O Non-octal digit
\h Head of word character (start of a word)
\H Non-head of word character
\p Punctuation [!"#$%&'()*+,\-./:;<=>?@\[\\\]^_{
\P Non-punctuation
\a Alphanumeric [a-zA-Z0-9]
\A Non-alphanumeric

Unicode Support

  • Default: \w, \d, \s, \h match ASCII characters only.
  • With u flag: These classes include Unicode characters (e.g., \w matches accented characters).

Character Sets

Custom character sets and ranges (e.g., [a-z], [^0-9]) are supported.

Note on Escaping in Character Classes: In character classes, special meaning is different. For example, [\]] matches a literal ], and [a\-z] matches a, \, or -.

4. Anchors and Boundaries

Anchors assert a position without matching characters (zero-width).

Anchor Meaning
^ Start of string (or start of line in multiline mode)
$ End of string (or end of line in multiline mode)
\< Start of word
\> End of word
\b Word boundary (matches at \< or \>)
\zs Sets the start of the match (everything before is excluded from the result)
\ze Sets the end of the match (everything after is excluded from the result)

Position Anchors

These anchors match at a specific position in the buffer. They are zero-width assertions and do not consume characters.

Anchor Meaning Example
\%nl Matches anywhere on line n (1-indexed). \%5lfoo matches "foo" only if it appears on line 5.

Not implemented in the parser, clients must handle line-based matching. | \%nc | Matches at column n (1-indexed). | \%5cfoo matches "foo" starting at column 5. | | \%# | Matches at the current cursor position. | \%#foo matches "foo" starting exactly under the cursor. |

Word Boundaries Explained

  • \<: Matches the position where a word starts (preceded by non-word, followed by word char).
  • \>: Matches the position where a word ends (preceded by word char, followed by non-word).
  • \b: Matches at either \< or \>.

Word boundaries \< and \> use the same character definition as \w ([a-zA-Z0-9_]). With the u flag, both adapt to Unicode.

5. Flags

Flags are appended after the pattern delimiter (e.g., pattern/flags).

Flag Name Description
i ignore-case Case-insensitive matching (overrides smartcase).
c case-sensitive Case-sensitive matching (overrides smartcase).
m multiline ^ and $ match line boundaries (\n), not just the start/end of the entire buffer.
s dotall . matches newlines (including end-of-line).
x verbose Whitespace and # comments in the pattern are ignored. Literal spaces must be escaped (e.g., \ or [ ]).
g global Match all occurrences (used for find-all or replace operations).
u unicode Enables Unicode support for character classes (\w, \d, etc.).

Verbose Mode Examples (x flag):

  • /foo bar/x matches "foobar" (space is ignored).
  • /foo\ bar/x matches "foo bar" (space is escaped).
  • /foo[ ]bar/x matches "foo bar" (space in bracket).

6. Escape Sequences

Sequence Matches
\n Newline (LF)
\t Tab
\r Carriage return (CR)
\f Form feed
\v Vertical tab
\\ Literal backslash

7. Groups, Alternation, and Assertions

  • Alternation: pattern1|pattern2 matches either pattern1 or pattern2.
  • Grouping: (pattern) groups part of the regex and captures it.
  • Named Capture: (?<name>pattern) captures the group with a specific name.
  • Non-Capturing Group: (?:pattern) groups without capturing.
  • Backreferences: \1 through \9 refer to captured groups 1-9. \0 refers to the entire match.

Lookaround Assertions

Lookarounds assert that what follows or precedes the current position matches a pattern, without including it in the match result.

Assertion Type Meaning
(?>=foo) Positive Lookahead Matches if followed by "foo".
(?>!foo) Negative Lookahead Matches if not followed by "foo".
(?<=foo) Positive Lookbehind Matches if preceded by "foo".
(?<!foo) Negative Lookbehind Matches if not preceded by "foo".
Commit count: 49

cargo fmt