Regular expressions

Table of contents

  1. Overview
  2. Basics
  3. Character classes
  4. Finding all
  5. Groups
  6. Replacement
  7. Examples

Overview

Table of meta characters

Meta Meaning
. Matches any character.
pq Matches the concatenation of the patterns p and q. This is an invisible infix operator, there is no meta character.
p* Matches the preceding pattern p zero or more times. (Kleene star)
p+ Matches the preceding pattern p one or more times. (Kleene plus)
p? Matches the preceding pattern p zero times or one time.
p|q An infix operator. Matches if at least one operand pattern p or q matches. This is called alternation.
p{m} Matches the preceding pattern exactly m times.
p{m,} Matches the preceding pattern m or more times.
p{m,n} Matches the preceding pattern from m upto n times.
() Grouping (undermine operator precedence).
{} Escape sequences.
[] Character classes.
(*...) A marked group, the match will be added to the list of groups.

Table of escape sequences

Escape
sequence
Meaning
{s} "\s" (space)
{t} "\t" (tabulator)
{n} "\n" (newline)
{r} "\r" (carriage return)
{.} "." (dot)
{*} "*" (asterisk)
{+} "+" (plus)
{(} "(" (left parenthesis)
{L} "{" (left curly bracket)
{R} "}" (right curly bracket)
{a} [A-Za-z] (alphabetic ASCII characters)
{d} [0-9] (ASCII digits)
{x} [0-9A-Fa-f] (hexadecimal ASCII digits)
{l} Lowercase ASCII letters
{u} Uppercase ASCII letters
{ua} Alphabetic Unicode characters
{ul} Lowercase Unicode letters
{uu} Uppercase Unicode letters
{g} [\u{21}-\u{7e}] (graphical: visible ASCII)
{_} Whitespace
{B} Beginning of the string
{E} End of the string
{LB} Beginning of a line
{LE} End of a line

Basics

use regex: re

# Some string matches itself
> re("café").match("café")
true

# Matches if any of two alternatives occur:
> r = re("moon|soon")
> [r.match("moon"), r.match("soon")]
[true, true]

# Character concatenation has a higher binding than '|',
# but we can undermine this by grouping.
> r = re("(m|s)oon")
> [r.match("moon"), r.match("soon")]
[true, true]

# A digit
re("0|1|2|3|4|5|6|8|9")

# A digit, expressed as a character class
re("[0-9]")

# A digit, shortest notation
re("{d}")

# A binary literal
re("(0|1)+")

# A binary literal, digits expressed as a character class
re("[01]+")

# A date (year-month-day)
re("{d}{d}{d}{d}-{d}{d}-{d}{d}")

# A date, leading zeros not needed
re("{d}{d}?{d}?{d}?-{d}{d}?-{d}{d}?")

# A date, using advanced quantifiers
re("{d}{4}-{d}{2}-{d}{2}")
re("{d}{1,4}-{d}{1,2}-{d}{1,2}")

# Muhkuh (moo(ing)? cow)
> re("Mu+h+").match("Muuuuuhhh")
true

# Kleene star
> r = re("(x|y)*")
> ["", "x", "y", "xx", "xy", "yx", "yy",
  "xxx", "xxy", "xxyxxxyyx"].all(|s| r.match(s))
true

# Integer literals
> r = re("[+-]?{d}+")
> r.match("-12")
true

# Simple floating point literals
re("[+-]?{d}+{.}{d}+")

# Full floating point literals
re("[+-]?({d}+({.}{d}*)?|{.}{d}+)([Ee][+-]?{d}+)?")

# Whitespace has no meaning inside of a regular expression
re("""
  [+-]?
  (   {d}+ ({.} {d}*)?
    | {.} {d}+
  )
  ([Ee] [+-]? {d}+)?
""")

# Whitespace has to be stated explicitly
> r = re("a{s}*b")
> r.match("a\s\s\s\s\sb")
true

> r.match("a\s\tb")
false

Task to the reader: How to state a pattern for full floating point literals that excludes integer literals?

Character classes

# Maches a single characters from a list.
re("(a|b|c|d|1|2)")

# Such a list may be written briefer as a character class.
re("[abcd12]")

# And ranges of characters can be stated,
# using range notation.
re("[a-d12]")

# Any range from Unicode code point upto another
# Unicode code point can be such an range.
# For example, the greek alphabet is:
re("[\u{0391}-\u{03a9}\u{03b1}-\u{03c9}]")
# That is:
re("[Α-Ωα-ω]")

# Escape sequences can occour inside of character classes.
re("[{d}{a}]")
# This is the same as:
re("[0-9A-Za-z]")

Finding all

Often one wants to find all non-overlapping patterns in a string. If r is some regex, then r.list(s) returns the list of all non-overlapping occurences of r in s.

use regex: re

word = re("{a}+")

text = "The quick brown fox jumps over the lazy dog."

print(word.list(text))
# Output:
# ["The", "quick", "brown", "fox", "jumps",
# "over", "the", "lazy", "dog"]

Groups

Groups can be extracted from a string, according to a regular expression. A group is formed by a pair of parentheses that has an asterisk after the opening parenthesis. There is a method groups that returns null if the regex does not match, otherwise the list of groups.

use regex: re
r = re("(*{d}{d}{d}{d})-(*{d}{d})-(*{d}{d})")

while true
   s = input("Date: ")
   t = r.groups(s)
   if t is null
      print("A well formed date please!")
   else
      print(t)
   end
end

# Date: 2016-10-14
# ["2016", "10", "14"]

Replacement

Rather than returning the list of non-overlapping matches, these matches x may be replaced by f(x). To achieve this, there is the method r.replace(s,f).

use regex: re

text = "The quick brown fox jumps over the lazy dog."

print(re("{a}+").replace(text,|x| "["+x+"]"))

# [The] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog].

Examples

A very simple tokenizer generator

use regex: re

function tokenizer(d)
   r = re(d["r"])
   f = d["f"] if "f" in d else null
   return fn|s|
      a = r.list(s)
      return a if f is null else a.map(f)
   end
end

words = tokenizer({
   r = "{a}+"
})

integers = tokenizer({
   r = "{d}+",
   f = int
})

numbers = tokenizer({
   r = "({d}|{.})+",
   f = |x| float(x) if '.' in x else int(x)
})

for line in input
   a = numbers(line)
   print(a)
end

Alternative versions of isalpha

use regex: re

function bind_regex(rs)
   r = re(rs)
   return |s| r.match(s)
end

isalpha_german = bind_regex("[A-Za-zÄÖÜäöüß]*")
isalpha_latin = bind_regex("""[
  A-Z a-z
  \u{00c0}-\u{00d6}
  \u{00d8}-\u{00f6}
  \u{00f8}-\u{024f}
]*""")

As one can see, the Unicode letter range was fragmented by throwing in two mathematical operators (\u{d7}, \u{f7}). Matching upper and lower case is even more complicated. Furthermore, in general a letter may be followed by combining characters.