| Crates.io | regexr |
| lib.rs | regexr |
| version | 0.1.0-beta.5 |
| created_at | 2025-11-28 07:33:55.223366+00 |
| updated_at | 2025-12-02 22:16:35.724866+00 |
| description | A high-performance regex engine built from scratch with JIT compilation and SIMD acceleration |
| homepage | |
| repository | https://github.com/farhan-syah/regexr |
| max_upload_size | |
| id | 1954940 |
| size | 3,386,780 |

A specialized, pure-Rust regex engine designed for LLM tokenization and complex pattern matching.
⚠️ Experimental - API May Change
This library was created as the regex backend for splintr, an LLM tokenizer. It is highly experimental and the API may change drastically between versions.
While it passes compliance tests for industry-standard tokenizer patterns (OpenAI's
cl100k_base, Meta's Llama 3), it has not been proven in production environments.Recommended for: Research, experimentation, tokenizer development, data preprocessing.
Not recommended for: Production systems requiring stability guarantees.
Please report issues on the Issue Tracker.
regexrThis is a specialized tool, not a general-purpose replacement.
The Rust ecosystem already has the excellent, battle-tested regex crate. For 99% of use cases, you should use that.
Only use regexr if you specifically need:
(?=...), (?<=...), or (?!\S) without C dependencies.
regex? It intentionally omits lookarounds to guarantee linear time.pcre2? Requires C library and FFI.regex/fancy-regex? Neither offers JIT compilation.pcre2? Requires C library and FFI.pcre2 due to unsafe C bindings or build complexity.pcre2).Developers building LLM tokenizers (like GPT-4 or Llama 3) currently face a dilemma in Rust:
regex crate: Fast, safe, but lacks lookarounds and JIT compilation.fancy-regex: Supports lookarounds, but lacks JIT compilation.pcre2: Supports everything including JIT, but introduces unsafe C bindings and external dependencies.regexr bridges this gap. It provides Lookarounds + JIT compilation + Backreferences while remaining 100% Pure Rust.
Add this to your Cargo.toml:
[dependencies]
regexr = "0.x"
For JIT compilation support:
[dependencies]
regexr = { version = "0.x", features = ["full"] }
use regexr::Regex;
let re = Regex::new(r"\w+").unwrap();
assert!(re.is_match("hello"));
// Find first match
if let Some(m) = re.find("hello world") {
println!("Found: {}", m.as_str()); // "hello"
}
// Find all matches
for m in re.find_iter("hello world") {
println!("{}", m.as_str());
}
use regexr::Regex;
let re = Regex::new(r"(\w+)@(\w+)\.(\w+)").unwrap();
let caps = re.captures("user@example.com").unwrap();
println!("{}", &caps[0]); // "user@example.com"
println!("{}", &caps[1]); // "user"
println!("{}", &caps[2]); // "example"
println!("{}", &caps[3]); // "com"
use regexr::Regex;
let re = Regex::new(r"(?P<user>\w+)@(?P<domain>\w+\.\w+)").unwrap();
let caps = re.captures("user@example.com").unwrap();
println!("{}", &caps["user"]); // "user"
println!("{}", &caps["domain"]); // "example.com"
Enable JIT for patterns that will be matched many times:
use regexr::RegexBuilder;
let re = RegexBuilder::new(r"\w+")
.jit(true)
.build()
.unwrap();
assert!(re.is_match("hello"));
For patterns with many literal alternatives (e.g., keyword matching in tokenizers):
use regexr::RegexBuilder;
let re = RegexBuilder::new(r"(function|for|while|if|else|return)")
.optimize_prefixes(true)
.build()
.unwrap();
assert!(re.is_match("function"));
use regexr::Regex;
let re = Regex::new(r"\d+").unwrap();
// Replace first match
let result = re.replace("abc 123 def", "NUM");
assert_eq!(result, "abc NUM def");
// Replace all matches
let result = re.replace_all("abc 123 def 456", "NUM");
assert_eq!(result, "abc NUM def NUM");
simd (default): Enables SIMD-accelerated literal searchjit: Enables JIT compilation (x86-64 and ARM64)full: Enables both JIT and SIMD| Platform | JIT Support | SIMD Support |
|---|---|---|
| Linux x86-64 | ✓ | ✓ (AVX2) |
| Linux ARM64 | ✓ | ✗ |
| macOS x86-64 | ✓ | ✓ (AVX2) |
| macOS ARM64 (Apple Silicon) | ✓ | ✗ |
| Windows x86-64 | ✓ | ✓ (AVX2) |
| Other | ✗ | ✗ |
Build without default features for a minimal installation:
cargo build --no-default-features
Build with all optimizations:
cargo build --features "full"
The library automatically selects the best execution engine based on pattern characteristics:
Non-JIT mode (default):
JIT mode (with jit feature):
See docs/architecture.md for details on the engine selection logic.
Speedup relative to regex crate (higher is better):

Highlights (speedup vs regex crate):
| Benchmark | regexr |
regexr-jit |
pcre2-jit |
|---|---|---|---|
| log_parsing | 0.80-0.84x | 3.91-4.09x | 3.57-3.71x |
| url_extraction | 0.81-0.83x | 1.95-1.99x | 2.10-2.13x |
| unicode_letters | 1.24x | 1.43-1.44x | 1.65-1.72x |
| html_tags | 0.82-0.87x | 1.33-1.43x | 0.80-0.85x |
| word_boundary | 1.19-1.24x | 1.15-1.19x | 0.72-0.74x |
| email_validation | 0.99-1.00x | 1.00-1.11x | 0.94-1.00x |
| alternation | 0.88-1.01x | 0.88-1.01x | 0.12-0.15x |
regexr-jit excels at log parsing (4x faster than regex)regexr (non-JIT) matches regex performance on most patternsfancy-regex and pcre2 (non-JIT) consistentlyIf you use regexr in your research, please cite:
@software{regexr2025,
author = {Syah, Farhan},
title = {regexr: A Pure-Rust Regex Engine with JIT Compilation for LLM Tokenization},
year = {2025},
url = {https://github.com/farhan-syah/regexr},
note = {Experimental regex engine with lookaround support and JIT compilation}
}