↑ Up |
Meta | Meaning |
---|---|
.
| Matches any character. |
pq
| Matches the concatenation of
the patterns p and q .
This is an invisible infix operator, there is no meta character.
|
p*
| Matches the preceding pattern p zero or more times.
(Kleene star)
|
p+
| Matches the preceding pattern p one or more times.
(Kleene plus)
|
p?
| Matches the preceding pattern p
zero times or one time.
|
p|q
| An infix operator. Matches if at least one operand
pattern p or q matches.
This is called alternation.
|
p{m}
| Matches the preceding pattern
exactly m times.
|
p{m,}
| Matches the preceding pattern m or more times.
|
p{m,n}
| Matches the preceding pattern from m
upto n times.
|
()
| Grouping (undermine operator precedence). |
{}
| Escape sequences. |
[]
| Character classes. |
(*...)
| A marked group, the match will be added to the list of groups. |
Escape sequence | Meaning |
---|---|
{s}
| "\s" (space)
|
{t}
| "\t" (tabulator)
|
{n}
| "\n" (newline)
|
{r}
| "\r" (carriage return)
|
{.}
| "." (dot)
|
{*}
| "*" (asterisk)
|
{+}
| "+" (plus)
|
{(}
| "(" (left parenthesis)
|
{L}
| "{" (left curly bracket)
|
{R}
| "}" (right curly bracket)
|
{a}
| [A-Za-z] (alphabetic ASCII characters)
|
{d}
| [0-9] (ASCII digits)
|
{x}
| [0-9A-Fa-f] (hexadecimal ASCII digits)
|
{l}
| Lowercase ASCII letters |
{u}
| Uppercase ASCII letters |
{ua}
| Alphabetic Unicode characters |
{ul}
| Lowercase Unicode letters |
{uu}
| Uppercase Unicode letters |
{g}
| [\u{21}-\u{7e}] (graphical: visible ASCII)
|
{_}
| Whitespace |
{B}
| Beginning of the string |
{E}
| End of the string |
{LB}
| Beginning of a line |
{LE}
| End of a line |
use regex: re # Some string matches itself > re("café").match("café") true # Matches if any of two alternatives occur: > r = re("moon|soon") > [r.match("moon"), r.match("soon")] [true, true] # Character concatenation has a higher binding than '|', # but we can undermine this by grouping. > r = re("(m|s)oon") > [r.match("moon"), r.match("soon")] [true, true] # A digit re("0|1|2|3|4|5|6|8|9") # A digit, expressed as a character class re("[0-9]") # A digit, shortest notation re("{d}") # A binary literal re("(0|1)+") # A binary literal, digits expressed as a character class re("[01]+") # A date (year-month-day) re("{d}{d}{d}{d}-{d}{d}-{d}{d}") # A date, leading zeros not needed re("{d}{d}?{d}?{d}?-{d}{d}?-{d}{d}?") # A date, using advanced quantifiers re("{d}{4}-{d}{2}-{d}{2}") re("{d}{1,4}-{d}{1,2}-{d}{1,2}") # Muhkuh (moo(ing)? cow) > re("Mu+h+").match("Muuuuuhhh") true # Kleene star > r = re("(x|y)*") > ["", "x", "y", "xx", "xy", "yx", "yy", "xxx", "xxy", "xxyxxxyyx"].all(|s| r.match(s)) true # Integer literals > r = re("[+-]?{d}+") > r.match("-12") true # Simple floating point literals re("[+-]?{d}+{.}{d}+") # Full floating point literals re("[+-]?({d}+({.}{d}*)?|{.}{d}+)([Ee][+-]?{d}+)?") # Whitespace has no meaning inside of a regular expression re(""" [+-]? ( {d}+ ({.} {d}*)? | {.} {d}+ ) ([Ee] [+-]? {d}+)? """) # Whitespace has to be stated explicitly > r = re("a{s}*b") > r.match("a\s\s\s\s\sb") true > r.match("a\s\tb") false
Task to the reader: How to state a pattern for full floating point literals that excludes integer literals?
# Maches a single characters from a list. re("(a|b|c|d|1|2)") # Such a list may be written briefer as a character class. re("[abcd12]") # And ranges of characters can be stated, # using range notation. re("[a-d12]") # Any range from Unicode code point upto another # Unicode code point can be such an range. # For example, the greek alphabet is: re("[\u{0391}-\u{03a9}\u{03b1}-\u{03c9}]") # That is: re("[Α-Ωα-ω]") # Escape sequences can occour inside of character classes. re("[{d}{a}]") # This is the same as: re("[0-9A-Za-z]")
Often one wants to find all non-overlapping patterns
in a string. If r
is some regex,
then r.list(s)
returns the list of
all non-overlapping occurences of r
in s
.
use regex: re word = re("{a}+") text = "The quick brown fox jumps over the lazy dog." print(word.list(text)) # Output: # ["The", "quick", "brown", "fox", "jumps", # "over", "the", "lazy", "dog"]
Groups can be extracted from a string, according to a
regular expression. A group is formed by a pair of parentheses that
has an asterisk after the opening parenthesis.
There is a method groups
that returns null
if the regex does not match, otherwise the list of groups.
use regex: re r = re("(*{d}{d}{d}{d})-(*{d}{d})-(*{d}{d})") while true s = input("Date: ") t = r.groups(s) if t is null print("A well formed date please!") else print(t) end end # Date: 2016-10-14 # ["2016", "10", "14"]
Rather than returning the list of non-overlapping matches,
these matches x
may be replaced by f(x)
.
To achieve this, there is the method r.replace(s,f)
.
use regex: re text = "The quick brown fox jumps over the lazy dog." print(re("{a}+").replace(text,|x| "["+x+"]")) # [The] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog].
use regex: re function tokenizer(d) r = re(d["r"]) f = d["f"] if "f" in d else null return fn|s| a = r.list(s) return a if f is null else a.map(f) end end words = tokenizer({ r = "{a}+" }) integers = tokenizer({ r = "{d}+", f = int }) numbers = tokenizer({ r = "({d}|{.})+", f = |x| float(x) if '.' in x else int(x) }) for line in input a = numbers(line) print(a) end
use regex: re function bind_regex(rs) r = re(rs) return |s| r.match(s) end isalpha_german = bind_regex("[A-Za-zÄÖÜäöüß]*") isalpha_latin = bind_regex("""[ A-Z a-z \u{00c0}-\u{00d6} \u{00d8}-\u{00f6} \u{00f8}-\u{024f} ]*""")
As one can see, the Unicode letter range was fragmented by
throwing in two mathematical operators (\u{d7}
,
\u{f7}
). Matching upper and lower case is
even more complicated. Furthermore, in general a letter may
be followed by combining characters.