Template reference ================== File syntax ----------- The template file should consist of one CREATE TABLE statement, with expressions telling how a value should be generated inside `{{ … }}` or `/*{{ … }}*/` blocks, e.g. ```sql CREATE TABLE "database"."schema"."table" ( "id" INTEGER, /*{{ rownum }}*/ "name" CHAR(40), /*{{ rand.regex('[a-zA-Z ]{40}') }}*/ UNIQUE KEY "some_index"("id") ); ``` Each `{{ … }}` block appearing alongside the column definitions represent one value in every generated row. The example above may produce output like ```sql INSERT INTO "table" VALUES (1, 'cHvkcFSMq YbERjjeBUzLaOG TOYvDrHhfymyQeP'), (2, 'kBAQlctdPLAlYZPyvYoBRIhYiEtOONCZQVpxpbbw'), (3, 'ILqyqKYi jTqXaUjsgdFqYxFasnUMzXaFRvqJDdx'), … ``` See [Advanced template features](./TemplateAdvanced.md) for more syntactical features. Expression syntax ----------------- `dbgen` supports an SQL-like expression syntax. These expressions will be re-evaluated for every new row generated. ### Literals `dbgen` supports numbers and string literals. * **Integers** Decimal and hexadecimal numbers are supported. The value must be between 0 and 264 − 1. Examples: `0`, `3`, `18446744073709551615`, `0X1234abcd`, `0xFFFFFFFFFFFFFFFF` * **Floating point numbers** Numbers will be stored in IEEE-754 double-precision format. Examples: `0.0`, `1.5`, `.5`, `2.`, `1e100`, `1.38e-23`, `6.02e+23` * **Strings** Strings must be encoded as UTF-8, and written between single quotes (double-quoted strings are *not* supported). To represent a single quote in the string, use `''`. Examples: `'Hello'`, `'10 o''clock'` ### Operators From highest to lowest precedence: 1. function call, array subscript `x[i]` 2. unary `-`, `+`, `~` 3. `*`, `/` 4. `+`, `-`, `||` 5. `&` 6. `|`, `^` 7. `=`, `<>`, `<`, `>`, `<=`, `>=`, `IS`, `IS NOT` 8. unary `NOT` 9. `AND` 10. `OR` 11. `:=` 12. `;` * **Division `/`** The division operator always result in a floating-point number (i.e. `3 / 2 = 1.5`). Use the `div` function for integer division. * **Concatenation `||`** The `||` operator concatenates two strings together. If either side is not a string, they will first be converted into a string. This operator cannot be used to concatenate arrays. * **Comparison `=`, `<>`, `<`, `>`, `<=`, `>=`** These operators will return TRUE, FALSE or NULL. When comparing two values, `dbgen` follows these rules: - Comparing with NULL always return NULL. - Numbers are ordered by values. - Strings are ordered lexicographically in the UTF-8 binary collation. - Arrays are ordered lexicographically by their elements. - Comparing two values with different types (e.g. `'4' < 5`) will abort the program. * **Identity `IS`, `IS NOT`** These operators will return TRUE or FALSE. `dbgen` follows these rules: - `NULL IS NULL` is TRUE. - Values having different types are not identical (`'4' IS 5` is FALSE). - Values having the same types compare like the `=` and `<>` operators. These operators are a generalization of standard SQL's `IS [NOT] {TRUE|FALSE|NULL}` operators. * **Logical operators `NOT`, `AND`, `OR`** These operators will first convert the input into a nullable boolean value (TRUE, FALSE or NULL): - NULL remains NULL. - Nonzero numbers become TRUE, `0` and `0.0` becomes FALSE, and `NaN` becomes NULL. - All other types cannot be converted to a boolean and will abort the program. The trinary logic operates like this: | AND | TRUE | NULL | FALSE | |----------:|:-----:|:-----:|:-----:| | **TRUE** | TRUE | NULL | FALSE | | **NULL** | NULL | NULL | FALSE | | **FALSE** | FALSE | FALSE | FALSE | | OR | TRUE | NULL | FALSE | |----------:|:-----:|:-----:|:-----:| | **TRUE** | TRUE | TRUE | TRUE | | **NULL** | TRUE | NULL | NULL | | **FALSE** | TRUE | NULL | FALSE | | NOT | value | |----------:|:-----:| | **TRUE** | FALSE | | **NULL** | NULL | | **FALSE** | TRUE | * **Bitwise operators `&`, `|`, `^`, `~`** These corresponds to bitwise-AND, -OR, -XOR and -NOT respectively. These operators only accept integers as input and produce *signed* results (e.g. `~1 = -2`). * **Assignment `:=`** The assignment expression `@ident := f()` would evaluate the RHS `f()` and save into the local variable `@local`. The same value can later be extracted using `@local`. This can be used to generate correlated columns, for instance: ```sql CREATE TABLE _ ( "first" BOOLEAN NOT NULL {{ rand.bool(0.5) }}, "second" BOOLEAN NOT NULL {{ @a := rand.bool(0.5) }}, "third" BOOLEAN NOT NULL {{ @a }} ); ``` The first and second columns are entirely independent, but the second and third column will always have the same value. Within the same generated file, a row can use local variables assigned from the previous row. For instance: ```sql {{ @prev := 0 }} CREATE TABLE _ ( "prev" INTEGER NULL {{ @prev }}, "cur" INTEGER NOT NULL {{ @prev := rownum }} ); ``` would produce ```sql INSERT INTO _ VALUES (0, 1), (1, 2), (2, 3), … ``` * **Statements `;`** The syntax `a; b; c` evaluates all 3 expressions in order, but only returns `c`. The results of `a` and `b` are discarded. The statement separator `;` can only be used * directly inside `{{ … }}`, and * as `THEN`/`ELSE` clauses of `CASE WHEN` expressions ### Symbols * **rownum**: The current row number of the main table. The first row has value 1. The derived rows share the same row number as the main row. * **subrownum**: The current number in a derived table. If one row of the main table generates *N* rows in the derived table, this constant will take values 1, 2, …, *N*. * **current_timestamp**: The timestamp when `dbgen` was started. This can be overridden using the `--now` parameter. * **NULL**: The null value. * **TRUE**: Equals to 1. * **FALSE**: Equals to 0. ### Random functions * **rand.regex('[0-9a-z]+', 'i', 100)** Generates a random string satisfying the regular expression. The second and third parameters are optional. If provided, they specify respectively the regex flags, and maximum repeat count for the unbounded repetition operators (`+`, `*` and `{n,}`). The input string should satisfy the syntax of the Rust regex package. The flags is a string composed of these letters: * `x` (ignore whitespace) * `i` (case insensitive) * `s` (dot matches new-line) * `u` (enable Unicode mode) * `a` (disable Unicode mode) * `o` (recognize octal escapes) The flags `m` (multi-line) and `U` (ungreedy) does not affect string generation and are ignored. * **rand.range(7, 19)** Generates a random integer uniformly distributed in the half-open interval 7 ≤ *x* < 19. The length of the range must be less than 264. * **rand.range_inclusive(8, 35)** Generates a random integer uniformly distributed in the closed interval 8 ≤ *x* ≤ 35. The length of the range must be less than 264. * **rand.uniform(2.4, 7.5)** Generates a random floating point number uniformly distributed in the half-open interval 2.4 ≤ *x* < 7.5. * **rand.uniform_inclusive(1.6, 8.4)** Generates a random floating point number uniformly distributed in the closed interval 1.6 ≤ *x* ≤ 8.4. * **rand.bool(0.3)** Generates a random boolean (0 or 1) with probability 0.3 of getting "1". Also known as the Bernoulli distribution. * **rand.zipf(26, 0.8)** Generates a random integer in the closed interval 1 ≤ *x* ≤ 26 using [Zipfian distribution] with an exponent of 0.8. With Zipfian distribution, the smallest values will appear more often. [Zipfian distribution]: https://en.wikipedia.org/wiki/Zipf's_law * **rand.log_normal(2.0, 3.0)** Generates a random positive number using the [log-normal distribution] (log *N*(*µ*, *σ*2)) with *μ* = 2.0 and *σ* = 3.0. The median of this distribution is exp(*µ*). [log-normal distribution]: https://en.wikipedia.org/wiki/Log-normal_distribution * **rand.finite_f32()**, **rand.finite_f64()** Generates a random finite IEEE-754 binary32 or binary64 floating-point number. The numbers are uniform in its *bit-pattern* across the entire supported range (±3.4 × 1038 for `f32`, ±1.8 × 10308 for `f64`) * **rand.uuid()** Generates a [version 4 (random) UUID](https://tools.ietf.org/html/rfc4122#section-4.4). The result is a string in the format `'aaaaaaaa-bbbb-4ccc-9ddd-eeeeeeeeeeee'`. ### Date and Time * **TIMESTAMP '2016-01-02 15:04:05.999'** Converts an ISO-8601-formatted string into a timestamp, using the time zone specified by the `--time-zone` flag. The timestamp is internally stored as UTC. If a time zone observes DST, there will be some time values which are impossible or ambiguous. Both of these cases will cause an "invalid timestamp" error. * **TIMESTAMP WITH TIME ZONE '2016-01-02 15:04:05.999 Asia/Hong_Kong'** Converts an ISO-8601-formatted string into a timestamp, using the time zone specified inside the string. The timestamp is internally stored as UTC. Only names in the `tz` database are recognized. The time zone will **not** be printed together with the timestamp. * **INTERVAL 30 MINUTE** Creates a time interval. The inner expression should evaluate a number (can be negative). Valid units are: - MICROSECOND - MILLISECOND - SECOND - MINUTE - HOUR - DAY - WEEK Intervals can be added to or subtracted from timestamps, and can therefore be used to generate a random timestamp. * **rand.u31_timestamp()** Generates a random timestamp distributed uniformly between 1970-01-01 00:00:01 and 2038-01-19 03:14:07 (UTC). There are exactly 231−1 seconds between these two time. ### Strings * **substring('ⓘⓝⓟⓤⓣ' FROM 2 FOR 3 USING CHARACTERS)** Extracts a substring from character 2 with length of 3 characters. "Character" means a Unicode codepoint here. Following SQL standard, the character position is 1-based, so this function call returns `'ⓝⓟⓤ'`. All of `FROM`, `FOR` and `USING` parts are optional. The `FROM` part defaults to 1 (start of string), and `FOR` part defaults to length of the string, e.g. ```sql substring('ⓘⓝⓟⓤⓣ' FOR 3) = 'ⓘⓝⓟ'; substring('ⓘⓝⓟⓤⓣ' FROM 3) = 'ⓟⓤⓣ'; ``` * **substring('input' FROM 2 FOR 3 USING OCTETS)** Extracts a substring from byte 2 with length of 3 bytes. Following SQL standard, the byte position is 1-based, so this function call returns `'npu'`. Both the `FROM` and `FOR` parts are optional. The `FROM` part defaults to 1 (start of string), and `FOR` part defaults to length of the string. * **octet_length('input')** Computes the byte length of the input string. * **char_length('ⓘⓝⓟⓤⓣ')**, **character_length('ⓘⓝⓟⓤⓣ')** Computes the character length of the input string. "Character" means a Unicode codepoint here. `character_length` is an alias of `char_length`; the two functions are equivalent. * **overlay('input' PLACING 'replacement' FROM 2 FOR 3 USING CHARACTERS)** Replaces the substring of `'input'` by the `'replacement'`. The meaning of `FROM`, `FOR` and `USING` when specified are equivalent to the `substring()` function. The `FOR` and `USING` parts are optional. The `FOR` part defaults to the length of the replacement string. ### Numbers * **greatest(*x*, *y*, *z*)** Returns the largest of all given values. NULL values are ignored. * **least(*x*, *y*, *z*)** Returns the smallest of all given values. NULL values are ignored. * **round(456.789, 2)** Rounds the number 456.789 to 2 decimal places (i.e. returns 456.79). The decimal place argument is optional, and defaults to 0. It can also be negative to round by powers of 10, e.g. `round(456.789, -2) = 500.0`. In case of break-even (e.g. `round(3.5)`), this function will round half away from zero. * **div(9, 4)**, **mod(9, 4)** Computes the quotient and remainder respectively when 9 is divided by 4 (i.e. 2). These two functions are related by ```sql n = div(n, d)*d + mod(n, d) ``` The `div(n, d)` function is equivalent to `n / d` truncated towards 0. The result of `mod(n, d)` has the same sign as the numerator `n`. When the denominator `d` is 0, both of these functions return NULL. * **mod(9, 4)** Computes the remainder when 9 is divided by 4 (i.e. 1). The result has the same sign as the numerator (+9). ### Arrays * **ARRAY['X', 'Y', 'Z']** Constructs an array with content 'X', 'Y', 'Z'. * ***arr*[3]** Extracts the 3rd element from the array *arr*. Following the SQL standard, the index is 1-based, i.e. *arr*[1] returns the first element. Returns NULL if the index is out of range of the array. * **generate_series(11, 31, 5)** Generates an array of value sequence `array[11, 16, 21, 26, 31]`. Both start and end points are inclusive. The step ("5" here) can be omitted and defaults to 1. It can also be negative to generate a decreasing sequence. ```sql generate_series(31, 11, -5) = array[31, 26, 21, 16, 11] ``` The sequence will not go beyond the end point. ```sql generate_series(11, 30, 5) = array[11, 16, 21, 26] generate_series(30, 11, -5) = array[30, 25, 20, 15] ``` * **rand.shuffle(*arr*)** Returns a new array by shuffling *arr*. ### Debugging * **debug.panic('X', 'Y', 'Z')** Causes the generation to fail with error when this function is executed. The input to `debug.panic()` will be evaluated and printed out, in the form like: ``` Error: runtime panic: 1. X 2. Y 3. Z --> 99:3 | 99 | {{ debug.panic('X', 'Y', 'Z') }}␊ | ^-------------------------^ | = ``` ### Miscellaneous * **CASE *value* WHEN *p1* THEN *r1* WHEN *p2* THEN *r2* ELSE *ro* END** Equivalent to the SQL simple `CASE … WHEN` expression. If *value* equals to *p1* (i.e. `(value = p1) IS TRUE`), then the expression's value is *r1*, etc. If the *value* does not equal to any of the listed pattern, the value *ro* will be returned. If the ELSE branch is missing, returns NULL. * **CASE WHEN *p1* THEN *r1* WHEN *p2* THEN *r2* ELSE *ro* END** Equivalent to the SQL searched `CASE WHEN` expression. If *p1* is true, then the expression's value is *r1*, etc. If all of the listed conditions are false or NULL, the value *ro* will be returned. If the ELSE branch is missing, returns NULL. * **coalesce(*v1*, *v2*, *v3*)** Returns the first non-NULL value. If all of *v1*, *v2*, *v3* are NULL, returns NULL. Note that `coalesce` is treated as a normal function, unlike standard SQL, and all arguments are evaluated before checking for nullability. Prefer `CASE WHEN` expression if you need to control the evaluation side-effect. * **@local** Gets the previous assigned local variable. If the variable was undefined, this will return NULL.