Regular expressions

Overview
Basics
Character classes
Finding all
Groups
Replacement
Examples

Overview

Table of meta characters

Meta	Meaning
`.`	Matches any character.
`pq`	Matches the concatenation of the patterns `p` and `q`. This is an invisible infix operator, there is no meta character.
`p*`	Matches the preceding pattern `p` zero or more times. (Kleene star)
`p+`	Matches the preceding pattern `p` one or more times. (Kleene plus)
`p?`	Matches the preceding pattern `p` zero times or one time.
`p\|q`	An infix operator. Matches if at least one operand pattern `p` or `q` matches. This is called alternation.
`p{m}`	Matches the preceding pattern exactly `m` times.
`p{m,}`	Matches the preceding pattern `m` or more times.
`p{m,n}`	Matches the preceding pattern from `m` upto `n` times.
`()`	Grouping (undermine operator precedence).
`{}`	Escape sequences.
`[]`	Character classes.
`(*...)`	A marked group, the match will be added to the list of groups.

Table of escape sequences

Escape sequence	Meaning
`{s}`	`"\s"` (space)
`{t}`	`"\t"` (tabulator)
`{n}`	`"\n"` (newline)
`{r}`	`"\r"` (carriage return)
`{.}`	`"."` (dot)
`{*}`	`"*"` (asterisk)
`{+}`	`"+"` (plus)
`{(}`	`"("` (left parenthesis)
`{L}`	`"{"` (left curly bracket)
`{R}`	`"}"` (right curly bracket)
`{a}`	`[A-Za-z]` (alphabetic ASCII characters)
`{d}`	`[0-9]` (ASCII digits)
`{x}`	`[0-9A-Fa-f]` (hexadecimal ASCII digits)
`{l}`	Lowercase ASCII letters
`{u}`	Uppercase ASCII letters
`{ua}`	Alphabetic Unicode characters
`{ul}`	Lowercase Unicode letters
`{uu}`	Uppercase Unicode letters
`{g}`	`[\u{21}-\u{7e}]` (graphical: visible ASCII)
`{_}`	Whitespace
`{B}`	Beginning of the string
`{E}`	End of the string
`{LB}`	Beginning of a line
`{LE}`	End of a line

Basics

use regex: re

# Some string matches itself
> re("café").match("café")
true

# Matches if any of two alternatives occur:
> r = re("moon|soon")
> [r.match("moon"), r.match("soon")]
[true, true]

# Character concatenation has a higher binding than '|',
# but we can undermine this by grouping.
> r = re("(m|s)oon")
> [r.match("moon"), r.match("soon")]
[true, true]

# A digit
re("0|1|2|3|4|5|6|8|9")

# A digit, expressed as a character class
re("[0-9]")

# A digit, shortest notation
re("{d}")

# A binary literal
re("(0|1)+")

# A binary literal, digits expressed as a character class
re("[01]+")

# A date (year-month-day)
re("{d}{d}{d}{d}-{d}{d}-{d}{d}")

# A date, leading zeros not needed
re("{d}{d}?{d}?{d}?-{d}{d}?-{d}{d}?")

# A date, using advanced quantifiers
re("{d}{4}-{d}{2}-{d}{2}")
re("{d}{1,4}-{d}{1,2}-{d}{1,2}")

# Muhkuh (moo(ing)? cow)
> re("Mu+h+").match("Muuuuuhhh")
true

# Kleene star
> r = re("(x|y)*")
> ["", "x", "y", "xx", "xy", "yx", "yy",
  "xxx", "xxy", "xxyxxxyyx"].all(|s| r.match(s))
true

# Integer literals
> r = re("[+-]?{d}+")
> r.match("-12")
true

# Simple floating point literals
re("[+-]?{d}+{.}{d}+")

# Full floating point literals
re("[+-]?({d}+({.}{d}*)?|{.}{d}+)([Ee][+-]?{d}+)?")

# Whitespace has no meaning inside of a regular expression
re("""
  [+-]?
  (   {d}+ ({.} {d}*)?
    | {.} {d}+
  )
  ([Ee] [+-]? {d}+)?
""")

# Whitespace has to be stated explicitly
> r = re("a{s}*b")
> r.match("a\s\s\s\s\sb")
true

> r.match("a\s\tb")
false

Task to the reader: How to state a pattern for full floating point literals that excludes integer literals?

Character classes

# Maches a single characters from a list.
re("(a|b|c|d|1|2)")

# Such a list may be written briefer as a character class.
re("[abcd12]")

# And ranges of characters can be stated,
# using range notation.
re("[a-d12]")

# Any range from Unicode code point upto another
# Unicode code point can be such an range.
# For example, the greek alphabet is:
re("[\u{0391}-\u{03a9}\u{03b1}-\u{03c9}]")
# That is:
re("[Α-Ωα-ω]")

# Escape sequences can occour inside of character classes.
re("[{d}{a}]")
# This is the same as:
re("[0-9A-Za-z]")

Finding all

Often one wants to find all non-overlapping patterns in a string. If r is some regex, then r.list(s) returns the list of all non-overlapping occurences of r in s.

use regex: re

word = re("{a}+")

text = "The quick brown fox jumps over the lazy dog."

print(word.list(text))
# Output:
# ["The", "quick", "brown", "fox", "jumps",
# "over", "the", "lazy", "dog"]

Groups

Groups can be extracted from a string, according to a regular expression. A group is formed by a pair of parentheses that has an asterisk after the opening parenthesis. There is a method groups that returns null if the regex does not match, otherwise the list of groups.

use regex: re
r = re("(*{d}{d}{d}{d})-(*{d}{d})-(*{d}{d})")

while true
   s = input("Date: ")
   t = r.groups(s)
   if t is null
      print("A well formed date please!")
   else
      print(t)
   end
end

# Date: 2016-10-14
# ["2016", "10", "14"]

Replacement

Rather than returning the list of non-overlapping matches, these matches x may be replaced by f(x). To achieve this, there is the method r.replace(s,f).

use regex: re

text = "The quick brown fox jumps over the lazy dog."

print(re("{a}+").replace(text,|x| "["+x+"]"))

# [The] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog].

Examples

A very simple tokenizer generator

use regex: re

function tokenizer(d)
   r = re(d["r"])
   f = d["f"] if "f" in d else null
   return fn|s|
      a = r.list(s)
      return a if f is null else a.map(f)
   end
end

words = tokenizer({
   r = "{a}+"
})

integers = tokenizer({
   r = "{d}+",
   f = int
})

numbers = tokenizer({
   r = "({d}|{.})+",
   f = |x| float(x) if '.' in x else int(x)
})

for line in input
   a = numbers(line)
   print(a)
end

Alternative versions of isalpha

use regex: re

function bind_regex(rs)
   r = re(rs)
   return |s| r.match(s)
end

isalpha_german = bind_regex("[A-Za-zÄÖÜäöüß]*")
isalpha_latin = bind_regex("""[
  A-Z a-z
  \u{00c0}-\u{00d6}
  \u{00d8}-\u{00f6}
  \u{00f8}-\u{024f}
]*""")

As one can see, the Unicode letter range was fragmented by throwing in two mathematical operators (\u{d7}, \u{f7}). Matching upper and lower case is even more complicated. Furthermore, in general a letter may be followed by combining characters.