Compiler Internals ================== This document describes how the compiler works and how its functionality can be extended. It is aimed at developers integrating projectM-EvalLib into their own applications and looking for adding functions and new features on top of the supported set of Milkdrop-compatible expression syntax. ## Scanner The scanner (or lexer) is responsible for tokenizing the input data stream and return the type and possible content of these tokens to the parser. This library uses GNU Flex to generate ode from a rules file, called `Scanner.l`. The scanner looks for these kinds of symbols, with letters always being case-insensitive: - Inline comments, starting with `//` until end of the current line - Block comments, enclosed in `/* ... */`, which can appear anywhere and also span multiple lines. - HEX constants in the form `$Xnn...`, scanned until any non-hexadecimal character is encountered (`[0-9a-f]`) - ORD constants in the form `$'c'` where `c` is any single character. - Math constants `$PI`, `$E` and `$PHI`. - The `gmem` identifier, as it's used in the non-function form `gmem[index]`, requiring special treatment in the grammar. - Numbers in integer and floating-point notation with an optional exponent. Note that these numbers are always positive, negativity is applied later via the `-` unary operator in the parser. - Identifiers starting with an underscore or alphabetical letter, and also numbers on any following character. These can be returned as two different tokens: - If the (lower-case) name exists in the function table, the name is returned as a `FUN` token. - Anything else is returned as a `VAR` token and identifies a variable. ## Grammar The grammar is defined in Bison (yacc) syntax and specifies the order in which tokens can appear and how they are transformed into a tree structure. In the grammar, different syntactical features are specified: - Tokens returned by the scanner and their associated C data types in the parser. - Operator precedence and associativity: Determines which operators are evaluated before others, and in which direction (left-to-right or right-to-left). - The recursively defined grammar, consisting of a list of tokens per rule and an action. Bison creates a state machine from the grammar, analyze any issues like potential ambiguities and take care of the proper shift/reduce operations. Bison also provides error handling. ### Compilation Errors In this grammar, any syntax error encountered will fail the compilation of the whole program. __This is a difference in comparison to the Milkdrop parser__, which will not fail compilation on certain circumstances, silently ignoring errors in the code. Reproducing the exact same behaviour would require a completely different approach in the grammar, scanner and parser. ## Expression Tree ### Node Object All grammar as described in the previous sections is scanned and compiled into a tree structure by the parser actions. Each node is a `prjm_eval_exptreenode` struct, which contains: - A function pointer `func`, which is determined the behaviour of this node. - A fixed float value `value` to store constant numbers or temporary values. - A union of either `var`, pointing to a variable storage location, or `memory_buffer` which is a pointer to a `projectm_eval_mem_buffer` with (g)megabuf data. - An array of pointers `args`, pointing to the argument node objects of the function. - A linked-list pointer `list` which stores an instruction list to be executed by loops or the special `/*list*/`" function. In general, only a few different "objects" will be created as tree nodes: - Constant: Return simple read-only float-typed numbers. Uses the `value` member to store the constant. - Variable: Return a read/write variable reference. USes the `var` member to store the variable storage location. - Memory access function: Operates on the memory buffer specified in `memory_buffer`, either retrieving or setting data. - Function: Executes objects in `args` and/or the expressions in `list`, then eventually applies an operation on the result and sets the return value to either a constant value or a variable reference. ### Node Function The function in each node object has a `void` return value and takes two parameters: - The execution context `ctx`, which is a pointer to the `prjm_eval_exptreenode` object the function will be executed on. Think of it as the C++ `this` pointers. - A pointer to a float pointer `ret_val`, which is the return value of the function. Every node function _must_ return either a value or a value reference. By convention, this is either the result of the operation the function carried out, or for functions using instruction lists, the result of the _last executed instruction_. Depending on the function, the result can either be a simple constant value or a reference to a variable. When calling a math function like `sin()`, implementing a math operator or returning a constant, the function must assign the value to the value pointed to by `*ret_val`. Use the `assign_ret_val()` macro to assign the return value: ```c assign_ret_val(sin(x)); ``` Some functions will not carry out an actual math operation, but return the result of an expression, which may be a variable reference, which may be used as the first argument of the internal `_set` function (or the AVS-specific `assign` function) to assign it the result of another expression. A reference can be assigned using the `assign_ret_ref` macro, example taken from the `prjm_eval_func_var` function: ```c assign_ret_ref(ctx->var); ``` As with all pointers, an assigned pointer must stay valid at least until the end of the program's execution. It is recommended to only use the following memory references: - The `var` pointer in variable access functions. - The address of the `value` member of the current `prjm_eval_exptreenode` object in which context the function runs. Do not assign pointers to locally defined variables. Passing `ret_val` as the return value to a function argument and let the function set it to the desired result is also viable: ```C prjm_eval_function_decl(myfunc) { assert_valid_ctx(); invoke_arg(0, ret_val); } ``` The `if` function evaluates its first argument, and depending on it being a true or false value, returns either the second (`true`) or the third (`false`) argument. Both arguments may return a variable reference, so the `if` functions needs to take care of this by simply calling the appropriate function and passing the current `ret_val` to it. This way, whatever the sub-expression assign to it will be returned by the `if` function. ### Temporary Compiler Objects #### Node Object Any tree node object created in a parser action will be wrapped inside a temporary compiler node of type `prjm_eval_compiler_node`, which stores a few additional flags only required inside the parser's reduce actions: - The type of the current parser node, either a single expression or an expression list. - Constness and state-changing flags of the expression (last expression in a list). - combined constness and state-changing flags of the expression list. Same as the previous one if only a single expression is stored in the compiler node. #### Function Argument List Inside a function argument list, all expressions are collected in a special `prjm_eval_compiler_arg_list` wrapper holding one compiler node object for each potential function argument. It also keeps the argument count as a separate variable for easy access. When the arguments have been collected and the function is reduced in the parser, the action will then compare the actual argument count against the count expected by the function. If the numbers don't match, a parse error is thrown. ### Optimizations The parser does perform two different optimizations during compile time to save execution time: #### Compile-time Evaluable Functions The compiler optimizes the generated code by evaluating constant expressions at compile time. These expressions purely consist of functions which both only have constant-evaluable arguments and have a fixed, deterministic behaviour. Examples of such functions, given their arguments are also const-evaluable: - Functions which return compile-time constants. - Non-assigment Operators. - Math functions like `sin`. Non-constant functions are functions which depend on variables or other memory contents which may change during or between executions or produce a non-deterministic result on each execution. Examples are: - Functions which have at least one non-const-evaluable argument. - Variable and memory access functions. - Random number generators. During compilation, when a new node is created from a subtree, each argument and the new node are checked for being const-evaluable. If all are, then the new node is immediately evaluated by calling the function, and a new constant node is inserted with the calculated value, replacing both the currently processed node and its argument nodes. #### Expression List Optimization In function arguments and inside parentheses, expression lists, e.g. multiple expressions separated by semicolons, can be written. This is a useful feature to execute a list of expressions based on certain conditions (e.g. in `if` or `loop`). While only the _last_ expression in a list determines the return value of the whole list, other expressions may perform calculations, set variables etc. and also need to be executed in order to achieve the correct result. There could also be expressions in the list which do not assign anything, and which are not the last expressions. These expressions are referred to as "non state-changing", and only waste CPU cycles because the result of any calculation done there is discarded. The expression compiler will keep track of the state-changing flag hierarchically, similar to the const flag, and remove any expressions which are nether state-changing nor relevant to the return value. In the case that a state-changing operator, e.g. an assignment, has a constant value as the left operand, it will be replaced by the const-evaluation optimization and lose its state-changing flag. Consider the following expression, where the first two expressions will be optimized out: ``` sin(1 + 2; 3); tan(5) = x; 5 = rand(100); x = cos($PI * .54); 5; ``` The third expression cannot be optimized out, because the assignment operator is potentially state-changing if it assigns to a variable or memory location and the `rand` function is not const-evaluable. The fourth expression is very similar, but this time, the use of a variable prevents the expression form being const-evaluable. The fifth and last expression is a simple constant and determines the return value of the whole expression list.