Template reference
==================
File syntax
-----------
The template file should consist of one CREATE TABLE statement, with expressions telling how a value
should be generated inside `{{ … }}` or `/*{{ … }}*/` blocks, e.g.
```sql
CREATE TABLE "database"."schema"."table" (
"id" INTEGER,
/*{{ rownum }}*/
"name" CHAR(40),
/*{{ rand.regex('[a-zA-Z ]{40}') }}*/
UNIQUE KEY "some_index"("id")
);
```
Each `{{ … }}` block appearing alongside the column definitions represent one value in every
generated row. The example above may produce output like
```sql
INSERT INTO "table" VALUES
(1, 'cHvkcFSMq YbERjjeBUzLaOG TOYvDrHhfymyQeP'),
(2, 'kBAQlctdPLAlYZPyvYoBRIhYiEtOONCZQVpxpbbw'),
(3, 'ILqyqKYi jTqXaUjsgdFqYxFasnUMzXaFRvqJDdx'),
…
```
See [Advanced template features](./TemplateAdvanced.md) for more syntactical features.
Expression syntax
-----------------
`dbgen` supports an SQL-like expression syntax. These expressions will be re-evaluated for every new
row generated.
### Literals
`dbgen` supports numbers and string literals.
* **Integers**
Decimal and hexadecimal numbers are supported. The value must be between 0 and
264 − 1.
Examples: `0`, `3`, `18446744073709551615`, `0X1234abcd`, `0xFFFFFFFFFFFFFFFF`
* **Floating point numbers**
Numbers will be stored in IEEE-754 double-precision format.
Examples: `0.0`, `1.5`, `.5`, `2.`, `1e100`, `1.38e-23`, `6.02e+23`
* **Strings**
Strings must be encoded as UTF-8, and written between single quotes (double-quoted strings are
*not* supported). To represent a single quote in the string, use `''`.
Examples: `'Hello'`, `'10 o''clock'`
### Operators
From highest to lowest precedence:
1. function call, array subscript `x[i]`
2. unary `-`, `+`, `~`
3. `*`, `/`
4. `+`, `-`, `||`
5. `&`
6. `|`, `^`
7. `=`, `<>`, `<`, `>`, `<=`, `>=`, `IS`, `IS NOT`
8. unary `NOT`
9. `AND`
10. `OR`
11. `:=`
12. `;`
* **Division `/`**
The division operator always result in a floating-point number (i.e. `3 / 2 = 1.5`). Use the
`div` function for integer division.
* **Concatenation `||`**
The `||` operator concatenates two strings together. If either side is not a string, they will
first be converted into a string. This operator cannot be used to concatenate arrays.
* **Comparison `=`, `<>`, `<`, `>`, `<=`, `>=`**
These operators will return TRUE, FALSE or NULL. When comparing two values, `dbgen` follows
these rules:
- Comparing with NULL always return NULL.
- Numbers are ordered by values.
- Strings are ordered lexicographically in the UTF-8 binary collation.
- Arrays are ordered lexicographically by their elements.
- Comparing two values with different types (e.g. `'4' < 5`) will abort the program.
* **Identity `IS`, `IS NOT`**
These operators will return TRUE or FALSE. `dbgen` follows these rules:
- `NULL IS NULL` is TRUE.
- Values having different types are not identical (`'4' IS 5` is FALSE).
- Values having the same types compare like the `=` and `<>` operators.
These operators are a generalization of standard SQL's `IS [NOT] {TRUE|FALSE|NULL}` operators.
* **Logical operators `NOT`, `AND`, `OR`**
These operators will first convert the input into a nullable boolean value
(TRUE, FALSE or NULL):
- NULL remains NULL.
- Nonzero numbers become TRUE, `0` and `0.0` becomes FALSE, and `NaN` becomes NULL.
- All other types cannot be converted to a boolean and will abort the program.
The trinary logic operates like this:
| AND | TRUE | NULL | FALSE |
|----------:|:-----:|:-----:|:-----:|
| **TRUE** | TRUE | NULL | FALSE |
| **NULL** | NULL | NULL | FALSE |
| **FALSE** | FALSE | FALSE | FALSE |
| OR | TRUE | NULL | FALSE |
|----------:|:-----:|:-----:|:-----:|
| **TRUE** | TRUE | TRUE | TRUE |
| **NULL** | TRUE | NULL | NULL |
| **FALSE** | TRUE | NULL | FALSE |
| NOT | value |
|----------:|:-----:|
| **TRUE** | FALSE |
| **NULL** | NULL |
| **FALSE** | TRUE |
* **Bitwise operators `&`, `|`, `^`, `~`**
These corresponds to bitwise-AND, -OR, -XOR and -NOT respectively. These
operators only accept integers as input and produce *signed* results (e.g.
`~1 = -2`).
* **Assignment `:=`**
The assignment expression `@ident := f()` would evaluate the RHS `f()` and save into the local
variable `@local`. The same value can later be extracted using `@local`. This can be used to
generate correlated columns, for instance:
```sql
CREATE TABLE _ (
"first" BOOLEAN NOT NULL {{ rand.bool(0.5) }},
"second" BOOLEAN NOT NULL {{ @a := rand.bool(0.5) }},
"third" BOOLEAN NOT NULL {{ @a }}
);
```
The first and second columns are entirely independent, but the second and third column will
always have the same value.
Within the same generated file, a row can use local variables assigned from the previous row.
For instance:
```sql
{{ @prev := 0 }}
CREATE TABLE _ (
"prev" INTEGER NULL {{ @prev }},
"cur" INTEGER NOT NULL {{ @prev := rownum }}
);
```
would produce
```sql
INSERT INTO _ VALUES
(0, 1),
(1, 2),
(2, 3),
…
```
* **Statements `;`**
The syntax `a; b; c` evaluates all 3 expressions in order, but only returns `c`. The results of
`a` and `b` are discarded. The statement separator `;` can only be used
* directly inside `{{ … }}`, and
* as `THEN`/`ELSE` clauses of `CASE WHEN` expressions
### Symbols
* **rownum**: The current row number of the main table. The first row has value 1. The derived rows share the same row
number as the main row.
* **subrownum**: The current number in a derived table. If one row of the main table generates *N* rows in the derived
table, this constant will take values 1, 2, …, *N*.
* **current_timestamp**: The timestamp when `dbgen` was started. This can be overridden using the `--now` parameter.
* **NULL**: The null value.
* **TRUE**: Equals to 1.
* **FALSE**: Equals to 0.
### Random functions
* **rand.regex('[0-9a-z]+', 'i', 100)**
Generates a random string satisfying the regular expression. The second and third parameters are
optional. If provided, they specify respectively the regex flags, and maximum repeat count for
the unbounded repetition operators (`+`, `*` and `{n,}`).
The input string should satisfy the syntax of the Rust regex package. The flags is a string
composed of these letters:
* `x` (ignore whitespace)
* `i` (case insensitive)
* `s` (dot matches new-line)
* `u` (enable Unicode mode)
* `a` (disable Unicode mode)
* `o` (recognize octal escapes)
The flags `m` (multi-line) and `U` (ungreedy) does not affect string generation and are ignored.
* **rand.range(7, 19)**
Generates a random integer uniformly distributed in the half-open interval 7 ≤ *x* < 19.
The length of the range must be less than 264.
* **rand.range_inclusive(8, 35)**
Generates a random integer uniformly distributed in the closed interval 8 ≤ *x* ≤ 35.
The length of the range must be less than 264.
* **rand.uniform(2.4, 7.5)**
Generates a random floating point number uniformly distributed in the half-open interval
2.4 ≤ *x* < 7.5.
* **rand.uniform_inclusive(1.6, 8.4)**
Generates a random floating point number uniformly distributed in the closed interval
1.6 ≤ *x* ≤ 8.4.
* **rand.bool(0.3)**
Generates a random boolean (0 or 1) with probability 0.3 of getting "1". Also known as the
Bernoulli distribution.
* **rand.zipf(26, 0.8)**
Generates a random integer in the closed interval 1 ≤ *x* ≤ 26 using [Zipfian distribution]
with an exponent of 0.8.
With Zipfian distribution, the smallest values will appear more often.
[Zipfian distribution]: https://en.wikipedia.org/wiki/Zipf's_law
* **rand.log_normal(2.0, 3.0)**
Generates a random positive number using the [log-normal distribution]
(log *N*(*µ*, *σ*2)) with *μ* = 2.0 and *σ* = 3.0.
The median of this distribution is exp(*µ*).
[log-normal distribution]: https://en.wikipedia.org/wiki/Log-normal_distribution
* **rand.finite_f32()**, **rand.finite_f64()**
Generates a random finite IEEE-754 binary32 or binary64 floating-point number.
The numbers are uniform in its *bit-pattern* across the entire supported range
(±3.4 × 1038 for `f32`, ±1.8 × 10308 for `f64`)
* **rand.uuid()**
Generates a [version 4 (random) UUID](https://tools.ietf.org/html/rfc4122#section-4.4).
The result is a string in the format `'aaaaaaaa-bbbb-4ccc-9ddd-eeeeeeeeeeee'`.
### Date and Time
* **TIMESTAMP '2016-01-02 15:04:05.999'**
Converts an ISO-8601-formatted string into a timestamp, using the time zone specified by the
`--time-zone` flag. The timestamp is internally stored as UTC.
If a time zone observes DST, there will be some time values which are impossible or ambiguous.
Both of these cases will cause an "invalid timestamp" error.
* **TIMESTAMP WITH TIME ZONE '2016-01-02 15:04:05.999 Asia/Hong_Kong'**
Converts an ISO-8601-formatted string into a timestamp, using the time zone specified inside
the string. The timestamp is internally stored as UTC.
Only names in the `tz` database are recognized. The time zone will **not** be printed together
with the timestamp.
* **INTERVAL 30 MINUTE**
Creates a time interval. The inner expression should evaluate a number (can be negative). Valid
units are:
- MICROSECOND
- MILLISECOND
- SECOND
- MINUTE
- HOUR
- DAY
- WEEK
Intervals can be added to or subtracted from timestamps, and can therefore be used to generate
a random timestamp.
* **rand.u31_timestamp()**
Generates a random timestamp distributed uniformly between 1970-01-01 00:00:01 and
2038-01-19 03:14:07 (UTC). There are exactly 231−1 seconds between these two time.
### Strings
* **substring('ⓘⓝⓟⓤⓣ' FROM 2 FOR 3 USING CHARACTERS)**
Extracts a substring from character 2 with length of 3 characters. "Character" means a Unicode
codepoint here. Following SQL standard, the character position is 1-based, so this function call
returns `'ⓝⓟⓤ'`.
All of `FROM`, `FOR` and `USING` parts are optional. The `FROM` part defaults to 1 (start of
string), and `FOR` part defaults to length of the string, e.g.
```sql
substring('ⓘⓝⓟⓤⓣ' FOR 3) = 'ⓘⓝⓟ';
substring('ⓘⓝⓟⓤⓣ' FROM 3) = 'ⓟⓤⓣ';
```
* **substring('input' FROM 2 FOR 3 USING OCTETS)**
Extracts a substring from byte 2 with length of 3 bytes. Following SQL standard, the byte
position is 1-based, so this function call returns `'npu'`.
Both the `FROM` and `FOR` parts are optional. The `FROM` part defaults to 1 (start of string),
and `FOR` part defaults to length of the string.
* **octet_length('input')**
Computes the byte length of the input string.
* **char_length('ⓘⓝⓟⓤⓣ')**, **character_length('ⓘⓝⓟⓤⓣ')**
Computes the character length of the input string. "Character" means a Unicode codepoint here.
`character_length` is an alias of `char_length`; the two functions are equivalent.
* **overlay('input' PLACING 'replacement' FROM 2 FOR 3 USING CHARACTERS)**
Replaces the substring of `'input'` by the `'replacement'`. The meaning of `FROM`, `FOR` and
`USING` when specified are equivalent to the `substring()` function.
The `FOR` and `USING` parts are optional. The `FOR` part defaults to the length of the
replacement string.
### Numbers
* **greatest(*x*, *y*, *z*)**
Returns the largest of all given values. NULL values are ignored.
* **least(*x*, *y*, *z*)**
Returns the smallest of all given values. NULL values are ignored.
* **round(456.789, 2)**
Rounds the number 456.789 to 2 decimal places (i.e. returns 456.79).
The decimal place argument is optional, and defaults to 0. It can also be negative to round by
powers of 10, e.g. `round(456.789, -2) = 500.0`. In case of break-even (e.g. `round(3.5)`), this
function will round half away from zero.
* **div(9, 4)**, **mod(9, 4)**
Computes the quotient and remainder respectively when 9 is divided by 4 (i.e. 2). These two
functions are related by
```sql
n = div(n, d)*d + mod(n, d)
```
The `div(n, d)` function is equivalent to `n / d` truncated towards 0.
The result of `mod(n, d)` has the same sign as the numerator `n`.
When the denominator `d` is 0, both of these functions return NULL.
* **mod(9, 4)**
Computes the remainder when 9 is divided by 4 (i.e. 1). The result has the same sign as the
numerator (+9).
### Arrays
* **ARRAY['X', 'Y', 'Z']**
Constructs an array with content 'X', 'Y', 'Z'.
* ***arr*[3]**
Extracts the 3rd element from the array *arr*. Following the SQL standard, the index is 1-based,
i.e. *arr*[1] returns the first element. Returns NULL if the index is out of range of the array.
* **generate_series(11, 31, 5)**
Generates an array of value sequence `array[11, 16, 21, 26, 31]`. Both start and end points are
inclusive.
The step ("5" here) can be omitted and defaults to 1. It can also be negative to generate a
decreasing sequence.
```sql
generate_series(31, 11, -5) = array[31, 26, 21, 16, 11]
```
The sequence will not go beyond the end point.
```sql
generate_series(11, 30, 5) = array[11, 16, 21, 26]
generate_series(30, 11, -5) = array[30, 25, 20, 15]
```
* **rand.shuffle(*arr*)**
Returns a new array by shuffling *arr*.
### Debugging
* **debug.panic('X', 'Y', 'Z')**
Causes the generation to fail with error when this function is executed. The input to
`debug.panic()` will be evaluated and printed out, in the form like:
```
Error: runtime panic:
1. X
2. Y
3. Z
--> 99:3
|
99 | {{ debug.panic('X', 'Y', 'Z') }}␊
| ^-------------------------^
|
=
```
### Miscellaneous
* **CASE *value* WHEN *p1* THEN *r1* WHEN *p2* THEN *r2* ELSE *ro* END**
Equivalent to the SQL simple `CASE … WHEN` expression.
If *value* equals to *p1* (i.e. `(value = p1) IS TRUE`), then the expression's value is *r1*,
etc. If the *value* does not equal to any of the listed pattern, the value *ro* will be
returned. If the ELSE branch is missing, returns NULL.
* **CASE WHEN *p1* THEN *r1* WHEN *p2* THEN *r2* ELSE *ro* END**
Equivalent to the SQL searched `CASE WHEN` expression.
If *p1* is true, then the expression's value is *r1*, etc. If all of the listed conditions are
false or NULL, the value *ro* will be returned. If the ELSE branch is missing, returns NULL.
* **coalesce(*v1*, *v2*, *v3*)**
Returns the first non-NULL value. If all of *v1*, *v2*, *v3* are NULL, returns NULL.
Note that `coalesce` is treated as a normal function, unlike standard SQL, and all arguments are
evaluated before checking for nullability. Prefer `CASE WHEN` expression if you need to control
the evaluation side-effect.
* **@local**
Gets the previous assigned local variable. If the variable was undefined, this will return NULL.