# `reggy` A friendly regular expression dialect for text analytics. Typical regex features are removed/adjusted to make natural language queries easier. Unicode-aware and able to search a stream with several patterns at once. ## Should I Use `reggy`? If you are working on a text processing problem with streaming datasets or hand-tuned regexes for natural language, you may find the feature set compelling. | Crate | Match Streams? | Case Insensitivity? | Pattern Flexibility? | |---------------------------------------------------|----------------|------------------------------------------------------------------------------------|----------------------| | [`aho-corasick`]( https://docs.rs/aho-corasick/ ) | ✅ | simple ASCII | string set | | [`regex`]( https://docs.rs/regex ) | ❌ | [Unicode best-effort](https://www.unicode.org/reports/tr18/#Simple_Loose_Matches) | full-featured regex | | `reggy` | ✅ | [Unicode best-effort]( https://www.unicode.org/reports/tr18/#Simple_Loose_Matches) | regex subset | ## API Usage Use the high-level [`Pattern`](https://doc-sieve.github.io/reggy/reggy/struct.Pattern.html) struct for simple search. ```rust let mut p = Pattern::new("dogs?")?; assert_eq!( p.findall_spans("cat dog dogs cats"), vec![(4, 7), (8, 12)] ); ``` Use the [`Ast`](https://doc-sieve.github.io/reggy/reggy/enum.Ast.html) struct to transpile to [normal](https://docs.rs/regex/) regex syntax. ```rust let ast = Ast::parse(r"dog(gy)?|dawg|(!CAT|KITTY CAT)")?; assert_eq!( ast.to_regex(), r"\b(?mi:dog(?:gy)?|dawg|(?-i:CAT|KITTY\s+CAT))\b" ); ``` ### Stream a File In this example, we will count the matches of a set of patterns within a file without loading it into memory. Use the [`Search`](https://doc-sieve.github.io/reggy/reggy/struct.Search.html) struct to search a stream with several patterns at once. Create a `BufReader` for the text. ```rust use std::fs::File; use std::io::{self, BufReader}; let f = File::open("tests/samples/republic_plato.txt")?; let f = BufReader::new(f); ``` Compile the search object. ```rust let patterns = [ r"yes|(very )?true|certainly|quite so|I have no objection|I agree", r"\?", ]; let mut pattern_counts = [0; 2]; let mut search = Search::compile(&patterns)?; ``` Call `Search::iter` to create a [`StreamSearch`](https://doc-sieve.github.io/reggy/reggy/struct.StreamSearch.html). Any IO errors or malformed UTF-8 will be return a [`SearchStreamError`](https://doc-sieve.github.io/reggy/reggy/enum.SearchStreamError.html). ```rust for result in search.iter(f) { match result { Ok(m) => { pattern_counts[m.id] += 1; } Err(e) => { println!("Stream Error {e:?}"); break; } } } println!("Assent Count: {}", pattern_counts[0]); println!("Question Count: {}", pattern_counts[1]); // Assent Count: 1467 // Question Count: 1934 ``` ### Walk a Stream Manually ```rust let mut search = Search::compile(&[ r"$#?#?#.##", r"(John|Jane) Doe" ])?; ``` Call `Search::next` to begin searching. It will yield any matches deemed [definitely-complete](#definitely-complete-matches) immediately. ```rust let jane_match = Match::new(1, (0, 8)); assert_eq!( search.next("Jane Doe paid John"), vec![jane_match] ); ``` Call `Search::next` again to continue with the same search state. Note that `"John Doe"` matched across the chunk boundary, and spans are relative to the start of the stream. ```rust let john_match = Match::new(1, (14, 22)); let money_match_1 = Match::new(0, (23, 29)); let money_match_2 = Match::new(0, (41, 48)); assert_eq!( search.next(" Doe $45.66 instead of $499.00"), vec![john_match, money_match_1, money_match_2] ); ``` Call `Search::finish` to collect any not-[definitely-complete matches](#definitely-complete-matches) once the stream is closed. ```rust assert_eq!(search.finish(), vec![]); ``` See more in the [API docs](https://doc-sieve.github.io/reggy). ## Pattern Language `reggy` is case-insensitive by default. Spaces match any amount of whitespace (i.e. `\s+`). All the reserved characters mentioned below (`\`, `(`, `)`, `{`, `}`, `,`, `?`, `|`, `#`, and `!`) may be escaped with a backslash for a literal match. Patterns are surrounded by implicit [unicode word boundaries](https://unicode.org/reports/tr29) (i.e. `\b`). Empty patterns or subpatterns are not permitted. ### Examples *Make a character optional with* `?` `dogs?` matches `dog` and `dogs` *Create two or more alternatives with* `|` `dog|cat` matches `dog` and `cat` *Create a sub-pattern with* `(...)` `the qualit(y|ies) required` matches `the quality required` and `the qualities required` `the only( one)? around` matches `the only around` and `the only one around` *Create a case-sensitive sub-pattern with* `(!...)` `United States of America|(!USA)` matches `USA`, not `usa` *Match digits with* `#` `#.##` matches `3.14` *Match exactly n times with* `{n}`*, or between n and m times with* `{n,m}` `(very ){1,4}strange` matches `very very very strange` ## Definitely-Complete Matches `reggy` follows "leftmost-longest", greedy matching semantics. A pattern may match after one step of a stream, yet may match a longer form depending on the next step. For example, `abb?` will match `s.next("ab")`, but a subsequent call to `s.next("b")` would create a longer match, `"abb"`, which should supercede the match `"ab"`. `Search` only yields matches once they are definitely complete and cannot be superceded by future `next` calls. Each pattern has a [maximum byte length](https://doc-sieve.github.io/reggy/reggy/enum.Ast.html#method.max_bytes) `L`, counting contiguous whitespace as 1 byte.[^1] Once `reggy` has streamed at most `L` bytes past the start of a match without superceding it, that match will be yielded. As a consequence, **results of a given `Search` are the same regardless of how a given haystack stream is chunked**. `Search::next` returns `Match`es as soon as it practically can while respecting this invariant. ## Implementation The pattern language is parsed with [`lalrpop`](https://lalrpop.github.io/lalrpop) ([grammar](https://github.com/doc-sieve/reggy/blob/main/src/parser/grammar.lalrpop)). The search routines use a [`regex_automata::dense::DFA`](https://docs.rs/regex-automata/latest/regex_automata/dfa/dense/struct.DFA.html). Compared to other regex engines, the dense DFA is memory-intensive and slow to construct, but searches are fast. Unicode word boundaries are handled by the [`unicode_segmentation`](https://docs.rs/unicode-segmentation/latest) crate. [^1]: This is why unbounded quantifiers are absent from `reggy`. When a pattern requires `*` or `+`, users should choose an upper limit (`{0,n}`, `{1,n}`) instead.