arXiv:2312.04282v1 [cs.DB] 7 Dec 2023 Adaptive Recursive Query Optimization Anna Herlihy Guillaume Martres Anastasia Ailamaki∗ Martin Odersky EPFL Switzerland anna.herlihy@epfl.ch EPFL Switzerland guillaume.martres@epfl.ch EPFL, Google Switzerland anastasia.ailamaki@epfl.ch EPFL Switzerland martin.odersky@epfl.ch applications. Most Datalog execution engines follow a bottomup evaluation [5] strategy, in which subqueries in the form of unions of conjunctive queries are repeatedly applied to relations. State-of-the-art Datalog execution engines provide manual or offline join-order optimizations for these subqueries, but require a profiling stage and cannot adapt if the real-time data does not match what was profiled. Abstract—Performance-critical industrial applications, including large-scale program, network, and distributed system analyses, are increasingly reliant on recursive queries for data analysis. Yet traditional relational algebra-based query optimization techniques do not scale well to recursive query processing due to the iterative nature of query evaluation, where relation cardinalities can change unpredictably during the course of a single query execution. To avoid error-prone cardinality estimation, adaptive query processing techniques use runtime information to inform query optimization, but these systems are not optimized for the specific needs of recursive query processing. In this paper, we introduce Adaptive Metaprogramming, an innovative technique that shifts recursive query optimization and code generation from compile-time to runtime using principled metaprogramming, enabling dynamic optimization and re-optimization before and after query execution has begun. We present a custom joinordering optimization applicable at multiple stages during query compilation and execution. Through Carac, we evaluate the optimization potential of Adaptive Metaprogramming and show unoptimized recursive query execution time can be improved by three orders of magnitude and hand-optimized queries by 4x. To avoid error-prone cardinality estimation [6], adaptive query processing (AQP) uses runtime information to tune query optimization [7]. This can be combined with codegenerating data management systems [8]–[10] that improve query execution performance by avoiding interpretation overhead by partially evaluating query plans into compiled executables. However, code-generating AQP systems do not regenerate and compile code at runtime, instead jumping between precompiled plans or code blocks to avoid the prohibitively high cost of invoking an optimizing compiler during query execution [11], [12]. Because recursive queries require repeated execution of subqueries for multiple iterations, the number of potential paths is prohibitively large to enumerate or compile ahead-of-time. Thus, there is a need for query optimization techniques that can leverage runtime information to meet the specific optimization criteria of recursive query processing. I. I NTRODUCTION The advanced algorithms powering contemporary applications require increasingly expressive query capabilities beyond traditional relational algebra, for example iterative or fixedpoint computations. This demand has led to a resurgence in the use of recursion in analytical queries, yet the development of recursive query optimization techniques lags behind those established for non-recursive queries. Support for recursion was added to the ANSI’99 SQL standard, and yet most relational data management systems are not optimized for the specific needs of recursive queries: during the execution of a single recursive query, resulting relations from the current iteration are fed as input into the next iteration of query evaluation. As a result, the relation cardinalities can change unpredictably and the number of iterations required to execute the query is not known ahead of time. Existing cardinality estimation techniques assume input relation cardinalities do not change during execution, so cannot guarantee high-quality join orders after the initial iteration. Instead, developers turn to Datalog for high-performance recursive query processing. Datalog is a popular query language for recursive queries that has been used for Java program analysis [1], TensorFlow [2], Ethereum Virtual Machine [3], AWS network analysis [4], and other performance-critical ∗ To address these challenges, this work introduces Adaptive Metaprogramming, a novel technique for runtime optimization and repeated re-optimization of recursive queries. We implement and evaluate the design in Carac, a system that tightly integrates a specialized Datalog execution engine with a general-purpose programming language and leverages Multi-Stage Programming [13], a known technique from the Programming Languages community, to support runtime code generation. This paper makes the following contributions: We introduce Adaptive Metaprogramming, a technique that combines the power of code-generating just-in-time data management systems with Multi-Stage Programming to enable adaptive, continuous query re-optimization by regenerating code at runtime. • We leverage Adaptive Metaprogramming to enable a custom optimization for Datalog queries to select efficient join orders of conjunctive subqueries at different stages in the lifetime of the query: from ahead-of-time, when only queries are available, to just-in-time, where runtime statistics like relation cardinalities are available. • Work done in its entirety at EPFL. 1 Initialize compiler thread Macros Carac Compile time Carac Run time Interpret Through Carac, we show how Adaptive Metaprogramming can speed up unoptimized Datalog queries over 1100x using only compile-time optimizations. Using only runtime optimizations, Carac can speed up unoptimized queries by over 2400x while maintaining a 4x speedup over manually hand-optimized queries. Multi-Stage Programming Query Compile time KNOWN: KNOWN: KNOWN: - All facts - Most fact schema - All fact schema - Most rule schema - All query schema - All queries - All Input data - All Input Data - Some facts > Generate indexes > Begin profiling - Some queries + specialize data interpreted QP + structures leverage incremental compilation II. BACKGROUND A. Datalog Datalog [14] is a rule-based declarative language based in logic programming. Datalog programs take the form of a finite set of facts and rules. Facts are stored in the extensional database (EDB) and can be considered the “input data”. Rules are stored in the intensional database (IDB) and define how new facts can be derived from existing facts. A typical Datalog query for the transitive closure of nodes in a graph: Query Run time Multi-Stage Programming • KNOWN: - All facts - All queries - All Input Data - Slow ops to target - Cardinalities - Any other runtime information > [De]optimize Fig. 1. Stages information becomes available MSP also enables deoptimization, necessary when the assumptions that the optimization was based upon have changed and the optimization may no longer show speedup, through deferring control flow easily between generated and existing code. path(x, y) :- edge(x, y) path(x, z) :- edge(x, y), path(y, z) III. A DAPTIVE M ETAPROGRAMMING A RCHITECTURE is read as “a path from variable x to y exists if an edge exists from x to y; and a path exists from x to z if there is an edge from x to a midpoint y, and a path from y to z”. Each clause within a rule is called an atom, e.g., edge(x, y). Note that the rule path is recursively defined since a predicate in the head (the left-hand-side) exists in the body (the right-hand side) of the rule. The bottom-up evaluation of a Datalog program iteratively computes subqueries containing unions of conjunctive queries for each derived predicate symbol until an iteration passes where no new facts are discovered [14]. Evaluation can be made more efficient using the Semi-Naive evaluation technique, where only facts from the previous iteration are used to discover new facts [15]. The key to Adaptive Metaprogramming is a staged approach: the Datalog program passes through a series of partial evaluations that continue past the start of the program runtime, re-generating code as needed. The fundamental performance benefit to code generation is specialization, which enables avoiding as many expensive data-dependent decisions as possible while applying the best execution strategy given the available information. Without the ability to make runtime decisions, query optimizers must rely on sometimes inaccurate estimations for cardinalities or cost models [6], employ an entirely separate profiling or “learning” stage, or miss out on certain optimizations entirely because the runtime cost is too high. Staging also allows for gradual specialization, so that optimization decisions can be made or previous ones revised as soon as new information becomes available. For example, one of the “killer” applications of Datalog has been program analysis [14], [16]–[19]. Static analyses that can take hundreds of lines of code to express in traditional imperative programming languages can be expressed in just a few lines of Datalog [18]. The nature of program analysis queries is that the information required to execute and optimize efficiently is not available at a single point in time. The diagram in Figure 1 shows potential stages within the lifecycle of a Datalog query and at which stage information becomes available. As an example, consider an analytical data management system that ingests user-defined functions (UDFs) and performs static analysis to identify optimization opportunities [20]. The analytical system will have predefined what analyses it will perform, so out-of-the-box it will have access to the schema of analysis queries, i.e. the Datalog rule schemas. It would also likely predefine what programming languages it accepts, so it will have access to input data (i.e. fact) schemas, plus any other built-in facts representing invariants. If the UDF is registered with the system aheadof-time, then the facts generated from the UDF code will become available for analysis. Then, once the developer wants B. Metaprogramming Metaprogramming is a programming technique in which programs can treat other programs as their data, sometimes as the input (e.g., analytical macros) or as the output (e.g., code generation). Multi-Stage Programming (MSP) [13] is a variant of metaprogramming in which compilation happens in a series of phases, starting at compile time and potentially continuing into multiple stages at runtime. The MSP literature breaks staging into two fundamental operations, quotes and splices [13]. Quotes delay evaluation of an expression to a later phase, generally referred to as staging an expression. Splices compose and evaluate expressions, generating either expressions or another quote. MSP is both generative, meaning it can generate code, and analytical, meaning code can be analyzed and decomposed into sub-components. MSP enables code to be specialized at the level of abstraction appropriate for the optimization, but is also principled as it provides safety guarantees. Unlike many code-generation frameworks that concatenate strings or inject instructions at runtime, it is not possible to generate code at runtime that is unsound. Staged code is parsed, type-checked, and partially compiled into an intermediate representation by the language compiler during the initial program compilation. 2 to deploy the UDF in an analytical pipeline, knowledge of the adjacent operators becomes available so only then can inter-operator analysis queries execute. For subsequent runs of that analysis, the optimizer should be able to use relevant information from the previous runs. Alternatively, if the UDF is not registered ahead-of-time, then only the rule schemas will be available to inform the query optimization ahead-oftime. Further, if the analytical system ingests custom analyses then neither the facts nor the rule schemas will be available before query execution begins and runtime optimization is the only option. Staging, via phases of compile and runtime code generation, allows the optimizer the freedom to leverage concrete information as it becomes available. intermediate relations with smaller cardinalities. However, we cannot easily predict ahead of time what the cardinality of the relations will be because the result of the previous loop is fed into the next iteration, and even within the same iteration, we may be 10 relations deep into an n-way join and predicting the size of the join sub-tree gets very complex very quickly. To avoid the complexity and inaccuracy of cardinality estimation [6], Carac reads the cardinality of relations at runtime and uses that information to inform the execution plan. The ideal execution order with respect to cardinality would be to sort the order of input relations to the join such that the smallest possible relation is always the outer relation, as Carac uses nested loop join. Yet relation cardinalities are not always available, for example in ahead-of-time optimization, so the reordering algorithm takes into account an additional heuristic to predict the selectivity of the join conditions. Selectivity is approximated using atom connectedness, namely the more variables in common two atoms have, the more selective the join condition and the smaller the resulting intermediate relation is likely to be. Constant filters are always pushed before the join, and in Carac, views allow for late materialization and pipelining. Carac’s atom reordering algorithm is as follows: To select an order of atoms given a rule with n atoms of varying cardinalities, including one delta relation, Carac will first sort the atoms by the cardinalities of their relations and put them into a stack. The first atom is popped off the stack (as it will have the smallest cardinality) and chosen as the leftmost atom. Carac will then identify the atom with the most selective join condition with respect to the chosen atom. The next mostselective, smallest-cardinality atom will be continually added with respect to the most recently added atom. If Carac finds an atom for which no other atoms have shared join conditions, it will return to the stack, pop off the atom with the next-smallest cardinality, and repeat the algorithm until no more atoms are left on the stack. The optimization can be first applied as soon as the Datalog rule schemas are available. This can be at Carac compiletime and implemented with compile-time generative macros. Depending on what the user supplies, at compile-time the algorithm may not yet have access to the relation cardinalities, as some facts have not yet been added to the system, so the join-ordering will rely on the available facts and the selectivity heuristic. The performance of query execution based on what stage facts and rules become available is presented in Section VI-C. The optimization can also be performed at query compile-time, where the rules and cardinalities of the initial extensional database are available. If facts are only provided at runtime then this will happen at the start of query execution. Further, since Adaptive Metaprogramming enables on-the-fly recompilation of subqueries, this optimization can be repeatedly applied over the course of query evaluation to adapt to changes in the relation cardinalities. The sorting algorithm used is Timsort [22], so re-sorting of an alreadysorted or nearly-sorted list will be more performant than sorting from scratch every time. Thus, continuous re-sorting, IV. RUNTIME J OIN -O RDER O PTIMIZATION The order of sub-clauses, called atoms, within a Datalog rule definition does not affect Datalog semantics. Thus rules may be rewritten with different orders of atoms without changing the result of the program. The order of atoms can, however, have a significant effect on the performance of the program. Because of this, we chose to target rearranging the order of atoms within a rule to illustrate the optimization potential of Adaptive Metaprogramming. In the following section, we explain why reordering atoms can lead to performance gain and present the optimization algorithm applied. The Carac intermediate representation contains a series of unions of conjunctive queries. These sub-queries are generated from the rule definitions, which determine the set of relations and the select, project, and join keys. For a rule path with a single definition of the form: path(x, y) :- path(x, z), edge(z, y), edge(y, “a”) the generated sub-query will take the form of ⋊$1=$2 path⋆ ⋉ ⋊$2=$4 edge⋆ )) [ π$0,$1 (σ$5=a (edge∆ ⋉ π$0,$1 (σ$5=a (edge⋆ ⋉ ⋊$1=$2 path∆ ⋉ ⋊$2=$4 edge⋆ )) π$0,$1 (σ$5=a (edge⋆ ⋉ ⋊$1=$2 path⋆ ⋉ ⋊$2=$4 edge∆ )) where ∆ represents “delta” relations and ⋆ represents “derived” relations. Delta relations store facts discovered during the current iteration, and may grow or shrink unpredictably between iterations, while derived relations store all facts discovered over the course of one stratum (or the entire program if unstratified) and will be monotonically increasing. If there are multiple definitions for path, the result will be the union of the above expression for each rule definition. As can be seen in the above expression, for a rule definition with n atoms in the body that share at least one variable with another atom, evaluation requires a union of n queries which each contain an n-way join of n − 1 derived relations and 1 delta relation. There are a factorial number of possible left-deep join orders for each sub-query, and it is well known that the order in which join operations are executed can profoundly affect the execution time [21]. When doing n-way joins, there is the potential for the intermediate relations to blow up in size significantly, only to be reduced in size in the final join. A more performant execution plan could avoid a cardinality blow-up by carefully selecting the order of atoms so that the sub-joins produce 3 even if the sort order does not change significantly, will not introduce significant overhead. The optimization will be applied based on the state of the database at the point in time that Carac’s JIT switches from interpretation to compilation, discussed in detail in Section V and evaluated in Section VI-B. 0 | val program = new Program(Execution(Storage())) 1 | val edge = program.relation[Constant]("edge") 2 | val path = program.relation[Constant]("path") 3 | val x, y, z = program.variable() 4 | path(x, y) :- edge(x, y) 5 | path(x, z) :- (edge(x, y), path(y, z)) 6 | edge("a", "a") :- () 7 | edge("a", "b") :- () 8 | path.solve() // Set((a, a), (a, b)) V. C ARAC : S YSTEM D ESIGN In this section we provide an overview of the system architecture of Carac, and show how Carac progressively lowers the input Datalog query into an executable. Section V-A describes the user-facing embedded DSL, Section V-B describes the generation and optimization of the logical query plan, Section V-C discusses generation of the physical plan for the Datalog-specific operators, and finally Section V-D describes relation storage and generation of the physical plan for the relational operators. We chose to write Carac in Scala because it provides Principled Metaprogramming [13] combined with compile-time macros [23], higher-order functions, and direct generation of JVM bytecode at runtime. Multi-Stage Programming language constructs allow us to generate code at compile-time, at the start of a Datalog query, or repeatedly during the execution of a Datalog query. The result of staging an expression is a transformed, often partially evaluated representation of the input program that can be stored and run later. In Carac, this can happen at different levels: the internal intermediate representation, stored higher-order functions, quotes containing typed ASTs of Scala code, or direct generation JVM bytecode. Fig. 2. Sample Carac input program generation of a precedence graph so that relations that rely on other relations will be calculated only after their dependencies are calculated. B. Execution Engine In this section, The main challenges that inform the design of Carac are discussed, including the generation and optimization of the logical query plan and how Carac’s JIT switches from interpreting the logical query plan to generating code for the physical query plan. 1) Logical Query Plan Generation Carac uses a Futamura Projection [24] to generate an imperative program specialized to the input Datalog program. In the first stage after execution begins, a series of static rewrite rules are applied to the AST to perform optimizations like alias elimination. After the AST transformation phase, the AST is visited to transform the declarative Datalog constraints into an imperative intermediate representation of the Datalog program, similar to Soufflé’s Relational Algebra Machine [16]. The Semi-Naive evaluation strategy [5] is used to partially evaluate the Datalog program into a series of operations called IROps, including unions of conjunctive queries, read/write/update operations, and a DoWhile operation. While the generated IROp program is imperative, we consider it the logical query plan for both the Datalog-specific operators and the traditional relational algebra operators. When Carac is in interpretation mode, there is no further partial evaluation and the interpreter visits this IROp tree. All IROp programs have a similar core structure, shown in Figure 3. Carac’s IROps are represented with generalized algebraic data types (GADTs) [25], so they can be either interpreted or compiled into a function that returns the correct type. Static optimizations like predicate pushdown and foreach-relation loop unrolling are performed in this phase. A. Carac DSL Carac supports a deep embedding of a Datalog-based DSL into the Scala programming language extended with stratified negation and aggregation. Similarly to Flix [17], Carac supports first-class Datalog constraints for interoperability between general-purpose programming and the declarative Datalog programs. A simple transitive closure Datalog program expressed in Carac can be seen in Figure 2. On line 0, the program environment is constructed from two components: (1) an execution engine instance, responsible for the Datalog evaluation strategy, and (2) a relational storage and execution layer, responsible for the management of the generated relations and execution of relational subqueries. Two facts for the relation edge are defined on lines 6 and 7, and two rules for path are defined on lines 4 and 5. On the final line 8, we can call solve on our rule path to begin execution of the query and get all the facts that can be generated from the input facts using the rule path and any other rule or fact relations that it depends on. Carac facts and rules can be defined at compile-time or incrementally added at runtime. Each fact is added to the corresponding relation in the relational layer as it is defined. As rules are defined, they are added to the program AST, and metadata is generated for each rule. The metadata includes the locations of variables and constants in the rule head and body, which will later be used to optimize execution, and 2) Compilation granularity In JIT mode, Carac will begin interpretation at the root node of the tree, and can switch to compilation at any node in the IROp program tree, illustrated in Figure 3. The node at which compilation begins is referred to as the compilation “granularity”. Nodes that appear higher in the tree have fewer instances, so compiling at a higher granularity will incur fewer compilations, but the code being compiled will be larger, and 4 Program Seq::EvalN DoWhile for each rule R Seq::EvalSN Union 3) Safe Points “Safe points” refer to locations within the execution of a program where an optimizing JIT compiler can safely switch between interpretation and compilation. When the Carac JIT reaches a node to compile, it can compile that node in “full” or “snippet”. When compiling in full, the node itself and its entire subtree are compiled into a single function. The interpreter will continue to visit nodes until it reaches the node of the correct granularity and then will switch entirely to the compiled code, only switching back once the entire subtree has been evaluated. The next time the interpreter reaches that node, it can use the existing compiled code, recompile it into a new function, or revert back to interpretation. Alternatively, Carac can compile in “snippet” when using the quotes compilation target, which compiles only the body of the targeted node, no children, and then control flow is deferred back to the interpreter from the compiled code by splicing continuations into the generated code. Compiling only snippets enables continuous checks for novel optimizations or deoptimization, which extends the staging into however many levels as needed. Additionally, snippet compilation allows the absolute minimum amount of code to be generated to mitigate some overhead costs of compilation. All program state is either returned from functions in the form of generated facts or stored in the relational storage layer. This property is why any IROp node can be considered a “safe point” to switch evaluation modes in Carac and the stack does not have to be saved or copied. for each rule R for each definition d of R Union::Rule 𝜎𝜋⋈ for each definition d of R for each atom in d Union::Body for each atom a of A in d 𝜎𝜋⋈ for each atom, {A-a} ⋅ 𝛥𝒂 Fig. 3. Datalog IROp Program Structure the information available to make optimization decisions may be outdated compared to compiling at a lower granularity. Therefore at the highest granularity, Seq::EvalSN, the optimization will be applied once for every iteration of the Semi-Naive algorithm. The cardinalities used will be from the Derived database, i.e. the state of the relations at the start of the outer program loop. At a lower granularity, like σπ ▷◁, the optimization will occur before every n-way join. The relation cardinalities at the lower granularity will be the most accurate, as they will take into account the exact cardinality of the delta relations, but the compilation will happen more frequently. To avoid unnecessary compilations, Carac applies a “freshness” test before recompiling on higher-overhead compilation targets like quotes to check that the cardinalities have changed, relative to other relations, above a tunable threshold. For the specific case of atom reordering, it is possible to push optimization entirely to runtime for the IRGenerator compilation target described in Section V-C. However, this limits the ability of Carac to combine it with other lower-level optimizations enabled by code generation, for example just-in-time data structure specialization. C. Compilation Targets Carac supports four different targets for compiling IROps into the physical query plan. The targets range from highly expressive and flexible, allowing for arbitrary runtime code generation and continuous re-optimization, to performant but limited or unsafe code generation. 1) Multi-Stage Programming Quotes & Splices The most powerful, expressive, and safe mechanism for Carac to generate code at runtime is to compile IROps into the Scala compiler’s Quote and Splice constructs, which implement classic Multi-Stage Programming concepts [13]. Quotes delay execution for a later stage and can be thought of as code generation, while splices trigger the execution of a quote. Quotes can contain other quote instances, enabling multiple stages of code generation. Nodes in the IROp program tree are compiled into quotes containing a single function definition at the outer-most level so that when it is time for the code to be executed, it can be called exactly like any other function. Once an operation is compiled, calling the function has no more overhead than an ahead-of-time compiled function. Quotes allow for de-optimizing if necessary, reverting back to interpretation, or further regenerating code. Because quotes are type-checked by the Scala compiler at Carac’s compile-time, they are both safe and the most powerful mechanism for code generation, as they allow for When Carac has begun compilation, it can either block waiting for the generated code to be callable or compile asynchronously and continue interpretation, switching to the compiled code only when it’s ready. Asynchronous compilation is intended for higher-granularity operations: for example, a union operation of n subqueries will begin compilation, then start interpreting the first sub-query. After each subquery is executed, the interpreter will check if the compilation is ready - if so, it will call into the compiled code such that it begins at the exact spot the interpreter left off. The goal is for Carac in JIT mode to not perform substantially worse than pure interpretation since even if compilation takes unexpectedly long, the interpreter can continue making progress evaluating the Datalog program while compilation happens on a different thread. However, this means that if the compilation takes longer than the time to finish interpretation of the sub-tree, then the generated code may never be used. 5 Time for 50 compilations (s) unlimited functionality within the generated code. However, the safety and expressiveness come at a price: compiling quotes into bytecode requires invoking the Scala compiler at runtime, which sometimes incurs significant overhead costs. Figure 4 compares the quote generation time depending on if the compiler is “warm”, i.e. compilation has happened at least 50x before measurement, and “cold” meaning the compiler is instantiated immediately before compilation. The legend displays at which granularity (illustrated in Figure 3) compilation happens. The left-hand side shows “Full”, meaning compilation of the entire subtree, and the right-hand side shows “Snippet”, which compiles only the operator and splices continuations to revert to the interpreter after the operator completes execution. The safety properties are especially important when considering Datalog language extensions that allow for arbitrary code execution, similar to a UDF in a traditional data management system. The quotes and splices backend enables safer expressivity of the input Datalog program because it prevents unsafe user-supplied code from executing within Carac. 6 DoWhile 5 EvalN σπ⋈ 4 EvalSN 3 Rule 2 Program Body 1 0 Cold compiler Warmed up (50x) compiler Full Cold compiler — Compilation Mode — Warmed up (50x) compiler Snippet with continuation Fig. 4. Execution time of code generation 4) IROp-Generation Lastly, Carac can regenerate IROps on the fly and invoke the interpreter to execute the new IROps when available, called “IRGenerator”. IROps are the most limited target, as they only allow for rearranging operations within the existing IROp program structure. For the case of the optimization described in Section IV, it suffices but would not scale to other optimizations that are applied at a lower level of abstraction than the IROps. While the program is still staged, as the IROps are a partially evaluated representation of the input Datalog program, the IRGenerator target does not need to invoke a separate compilation phase so the overhead is limited to the cost of sorting and rearranging the subqueries at runtime. The tradeoffs between the flexibility and performance of the different compilation targets are evaluated in Section VI. The choice of backend is left as a user parameter because Carac cannot anticipate if a user may wish to prioritize performance over safety (and would therefore use the bytecode target), expressibility of the Datalog variant over performance (quotes and splices), safety over expressivity (lambda), or are otherwise limited due to the execution environment (for example, using Graal Native-Image one cannot use runtime classloading and could therefore only use lambda and IROpgeneration), while running on the JVM enables all backends. 2) Bytecode Alternatively, Carac supports direct generation of JVM bytecode from IROps, using the experimental JEP Class-File API Preview [26]. The bytecode generator can be invoked at runtime, but unlike Quotes, it does not generate code that can revert to interpretation once compiled. However, if the interpreter reaches a node that has been compiled into bytecode it can choose to throw out the previously generated bytecode and regenerate. Direct-to-bytecode generation is, in theory, as expressive as quote generation as it also allows for arbitrary code execution. However, manually generating bytecode instructions is errorprone and requires a much higher development cost than the Quotes API, which only requires writing Scala code as one would for regular, non-staged code. Further, because the generated bytecode is interpreted directly by the JVM, there are few safety guarantees. Bypassing the safety checks of invoking a full compiler leads to lower overheads but at the cost of losing the ergonomics and safety of writing in a typesafe programming language. D. Physical Relational Operator Layer Carac’s execution layer is built on top of a pluggable “relational layer” to store input and intermediate relations and plan and execute basic relational algebra operations. Each rule has two types of intermediate relations: a Derived database and a Delta database. The Derived database stores all the facts discovered so far throughout a Datalog program evaluation. The Delta database stores only the new facts derived in the previous iteration of the main program loop. The two databases are further split into Known and New databases to isolate the reading of previously derived facts within a subquery and writing of newly derived facts in the same query. This split between read-only and write-only relations, which are swapped at the end of each iteration, is what enables flexibility in Carac’s safe points. 3) Lambda Carac supports a lambda-based backend that, at runtime, stitches together higher-order functions that are precompiled at Carac’s compile-time. The Lambda backend is a more limited compilation target as Carac can only choose between predefined procedures. However, the cost of invoking the Scala compiler at runtime can be avoided while maintaining the safety guarantees lacking from bytecode generation. The performance of the generated lambdas benefits from the ability of the JVM to aggressively inline these smaller functions to avoid overhead costs of function calls. 6 2500 Speedup over "unoptimized" The relational layer API provides read and write access to the relations, as well as clear, swap, and diff operations. swap and clear serve to efficiently implement copying all the newly derived facts into the existing derived facts at the end of every loop iteration, and diff determines if the loop should terminate. Finally, the relational layer API provides relational operators that make up the sub-queries: select, project, join, and union. Typical relational algebra optimization and compilation can be performed at this level. So far, Carac has been integrated with a typical push-based and a pull-based engine that stores relations in in-memory Scala collections. 2000 1500 1000 Hand-optimized JIT IRGenerator JIT Quotes Async JIT Quotes Blocking JIT Bytecode Async JIT Bytecode Blocking JIT Lambda Async JIT Lambda Blocking 500 0 Points-To Inverse Functions Benchmark Fig. 5. Async/Blocking, compared to unoptimized (program analysis queries) VI. E VALUATION deserializes the list and returns the result. The input Scala program, which we named “SListLib”, is a minimized representation of a typical off-the-shelf utility within an analytical pipeline that does a bit of “wasted” work that can be optimized (in this case, extra serialization). Carac automatically derives facts from our input Scala program by querying over the TASTy files using TASTy Query [28]. The input Scala program is around 200 lines of Scala code. For the input Carac query, we then extend a points-to analysis query with rules that define when values could be considered equivalent given adjacent functions that can revert data transformations. We then add the knowledge of which functions can “undo” each other given the correct arguments with the fact containing the name of the list library serialization functions: To evaluate Carac with a range of query structures, we construct a set of queries to benchmark that include queries that are representative of program analyses and more general recursive queries. In Section VI-A, we discuss the details of these queries and why they were chosen. In Section VI-B, we evaluate Carac’s runtime-optimized query performance compared to interpreted queries. In Section VI-C we evaluate Carac’s compile-time-optimized query performance compared to runtime-optimized, as well as the combination of compiletime with runtime optimization. In Section VI-D, we compare Carac’s runtime with a current state-of-the-art Datalog execution engine, Soufflé. Experiments are run on Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz (2 x 12-core) with 395GB RAM, on Ubuntu 18.04.5 LTS with Linux kernel 4.15.0-151-generic and Scala 3.3.1-RC4. The JVM used is OpenJDK 64-Bit Server VM (build 17.0.2+8-86, mixed mode, sharing) and OpenJDK Runtime Environment (build 17.0.2+8-86) with a default heap size of 36GB. Each experiment is run with the Java Benchmarking Harness (JMH) [27] version 1.32 with 10 iterations, 10 warm-up iterations, 10 calls per iteration (“batch size”), and a minimum execution time of 10 seconds per benchmark unless otherwise noted. InverseFns("deserialize", "serialize") We evaluate Carac with our inverse function analysis and a typical “points-to” analysis based on Andersen’s (contextfree and flow-insensitive) points-to analysis [29] on the facts generated from the SListLib program. We also include a few general, shorter-running Datalog programs: an Ackermann function [30], a small constraint-based analysis example program (“CBA”), a Fibonacci program, and an equality and prime-numbers program. The purpose of these smaller programs is to illustrate how the complexity of the Carac query affects the ability of the runtime optimizations to speed up the query. When talking about runtime optimizations, the longer the program the more time there is to calculate and apply optimizations without introducing disproportionate overhead. Smaller, faster-running programs are much less forgiving than longer, more complex program analysis queries. These smaller queries identify the point at which the queries become so short-running that the online optimization no longer shows meaningful speedup. A. Carac Query Benchmarks To select representative queries for our evaluation, we first consider a custom Datalog-based static analysis query to identify potential “wasted” work in the form of adjacent inverse functions. Note that most C optimizing compilers will not identify and reverse inverse operations. For example, for analyzing a C program, we can add the fact inverse(itoa, atoi) for integer-to-string conversions (but not the reverse, since atoi is not a total function on the domain of strings). Where possible, calls to entire symmetric serialization libraries can be declared as inverse, e.g., inverse(to_json, from_json). For the input data to this query, we select an input Scala program to analyze that implements and utilizes a custom linked list library in Scala that supports a serialize and deserialize function. The entry point to our input Scala program instantiates the custom linked list, operates over the list’s contents, serializes the list, does some computation, then B. Online Optimization Speedup To evaluate the effectiveness of Carac’s runtime optimization and staging mechanisms, we consider staged queries, executed with the online optimization described in Section IV, compared with interpreted “baseline” queries. Because there is no “typical” way to order Datalog atoms, we consider two 7 JIT IRGenerator Speedup over "optimized" 5 JIT Quotes JIT Bytecode formulations of our input Carac queries representing the best and worse cases. One formulation has been hand-optimized and rewritten manually by carefully stepping through execution and tracking intermediate relation cardinalities. Handoptimizing Datalog programs is a tedious process requiring advanced knowledge of Datalog evaluation algorithms and the internal execution engine implementation details. For nonexperts, finding the correct join order for complex rules has been called an ”insurmountable” challenge [31]. The other formulation of the Carac queries is written to be inefficient, as to simulate what happens if a naive user were to have bad luck in their order of atoms. There is no obvious way for a user to tell the difference between a good and bad ordering of atoms, so we compare the most and least efficient formulations we could identify to illustrate the speedup bounds of the optimization. As it is also nontrivial for Carac to determine if a query is optimized ahead of time or not, we use both types of input query to evaluate Carac’s ability to optimize programs that have room to improve without overly penalizing programs whose performance is already close to optimal. In all experiments, Carac executes single-threaded with only compilation occurring on a separate thread. When compiling on a separate thread, Carac can either block (“blocking”) or continue interpretation until the compiled code is ready (“async”). The storage engine is a push-based execution engine that stores relations in-memory in Scala Collections. The choice of staging mechanism within Carac is referred to as the “backend”. Because of the wide range of execution time between queries, the figures in Section VI-B report “speedup” (optimized execution time over baseline execution time) to normalize the results, but the full execution times for the baseline interpreted queries are reported in Table I. JIT Lambda 4 3 2 o Fu nc ti o ns er se Benchmark In v sT n Po in t m an es A ck er Pr im Fi bb on C BA Eq ua l ac ci 1 0 1.5 1.3 JIT IRGenerator JIT Quotes JIT Bytecode JIT Lambda 1.1 0.9 0.7 In v Fu nc ti o ns o Benchmark er se Po in t sT n m an es A ck er Fi bb on Pr im BA Eq ua l ac ci 0.5 0.3 C Speedup over "optimized" Fig. 6. Blocking, compared to hand-optimized Fig. 7. Async, compared to hand-optimized Speedup over "unoptimized" 120 Hand-optimized 100 JIT IRGenerator 80 JIT Quotes 60 JIT Bytecode 40 Baseline Query Ackermann CBA Equal Fibonacci Primes Points-To Inverse Functions JIT Lambda 20 0 CBA Equal Fibbonacci Benchmark Primes Ackermann Speedup over "unoptimized" Hand-optimized 20 JIT IRGenerator 15 10 1) JIT-optimizing “unoptimized” programs We first consider how close Carac can get the performance of programs that are not already optimized to the performance of hand-optimized programs. In Figures 5, the speedup of the JIT-optimized Carac program over the “unoptimized”, interpreted input program is shown for both the traditional pointsto analysis and our inverse function analysis programs, both applied to facts generated from the SListLib input program. The “hand-optimized” program runs about 2263x faster than the “unoptimized” program for the inverse function analysis. In the best case, Carac’s JIT optimizer produces a program with a speedup of around 2424x over the “unoptimized” input program, with no input from the user. The IRGenerator JIT Quotes JIT Bytecode JIT Lambda 5 0 CBA Equal Fibbonacci Benchmark Primes Unoptimized 7.001168 0.038313 0.092675 0.336177 0.103381 7229.473415 16980.17118 E XECUTION TIME ( S ) OF INTERPRETED C ARAC QUERIES Fig. 8. Blocking, compared to unoptimized (general queries) 25 Optimized 0.272208 0.008064 0.019881 0.102191 0.015402 12.246468 7.504203 TABLE I Ackermann Fig. 9. Async, compared to unoptimized (general queries) 8 backend also can outperform the “hand-optimized” programs but less consistently than the Lambda backend. The reason for this improved performance is that for the JIT the optimization is applied repeatedly, while for the hand-optimized program the same join order is used throughout the execution. a) Compilation Targets. 1538x speedup when blocking. For short-running compilations, like that of the Lambda backend, blocking compilation shows a 2424x speedup compared to the async speedup of only 1700x, as no iteration is allowed to proceed without the optimization applied and the JVM has more opportunities to optimize the generated lambdas. For shorter-running programs that operate over smaller amounts of data, Carac’s JIT optimizer shows speedups closer to the hand-optimized potential for asynchronous and blocking compilation as seen in Figures 9 and 8. We compare the different backends to show the tradeoff between expressibility, maintainability, and performance. The Quotes backend is the most expressive and easy to maintain as the code generated is fully typed at compile-time by the Scala compiler. It also allows for finer-tuned application of optimizations and arbitrary runtime code generation. However, the overhead costs of invoking the compiler and waiting on the result can limit the speedup, for example for the Inverse Functions query with synchronous compilation, the Quotes speedup is only 1538x, compared to the “potential” speedup of 2260x. At the other end of the spectrum, the Bytecode backend more closely resembles traditional LLVM-IR generating RDBMSs and requires unsound generation of bytecode. The bytecode compilation target exchanges maintainability and correctness checks for performance, as it performs comparably to the Quotes backend on some programs, like the inverse function analysis, but performs better on other programs, like the Ackermann function. The Lambda backend relies on the JVM’s ability to aggressively inline functions that are called repeatedly. While the Quotes and Bytecode backends will improve with more iterations, as the JVM warms up, they do not show the same sharp increase in performance as the Lambda backend does with repeated calls. However the Lambda backend can only call the set of predefined lambdas, so it cannot arbitrarily generate code at runtime the same way that the Quote and Bytecode backends can. The Lambda backend can even outperform the interpreted “hand-optimized” program on some experiments due, in part, to the lack of tree traversal overhead. Lastly, the IRGenerator will regenerate the internal IR on the fly, which can then be passed to the interpreter. As only the IROps are generated, the “compilation” phase is minimal compared to the other backends, so the overhead of applying the optimization is minimal. b) Asynchronous Compilation. 2) JIT-optimizing “hand-optimized” programs We now consider how much Carac can unintentionally degrade the performance of programs that are already handoptimized. In Figures 7 and 6, we show the speedup (or slowdown) of applying Carac’s JIT optimizations to already hand-optimized input Datalog programs. At worst, for very short-running programs where the cost of applying any optimization is a large proportion of the total runtime, Carac shows an end-to-end slowdown of 0.3x compared to the “handoptimized” program run on the interpreter. For the rest of the programs, the average slowdown when async is 0.8x of the optimized runtime and for blocking does not show any performance degradation and is 1.3x faster due to continually applying optimizations. a) Compilation Targets. In contrast to the results presented in Section VI-B1, for already-optimized programs, generally the lower the overhead of compilation for the backend, the worse the Carac JIT compiler performs - as it enables Carac to more quickly apply an optimization that in fact produces a slightly slower program than the input program. For example, for the Points-To analysis on SListLib, the Lambda backend has a runtime of 0.3x the “hand-optimized” runtime. The Quotes backend, which has a higher compilation cost, has a slightly better runtime of 0.5x the “hand-optimized” program. How close the optimization can get to the “hand-optimized” program, or if it can be exceeded, is a property of the program itself. For example, the Ackermann function benchmark does not experience any performance degradation when Carac applies JIT optimization to already-optimized programs using the Lambda backend. In fact, the JIT-optimized program is slightly faster than the interpreted hand-optimized program because the difference between hand-optimized and JIT-optimized is so small that the benefits of compiling and avoiding tree traversal outweigh the speedup of hand-tuning the input program. Ultimately, hand-tuning programs have consistently shown better performance than compiler-optimized programs but require intense effort and deep knowledge of Datalog and Caracspecific implementation details, and are challenging to scale to large input programs. Because Carac’s “safe points” occur between IROps, for nonblocking compilation, the overheads of invoking the compiler can delay the application of the optimization so the Carac program may run un-optimized for a few iterations before the compiled code is ready. In Figure 5, for the bars marked “async” Carac does not wait for compilation to finish and instead continues interpreting the unoptimized program until the compiled, optimized program is ready. For the bars marked “blocking”, Carac blocks evaluation until the optimized code is ready, so no unoptimized Datalog is ever run, but the overheads of invoking the compiler are unavoidable. For longer-running compilations, like when generating Quotes, blocking to wait for compilation can impact the overall speedup significantly. For example, for the inverse function benchmark, Carac goes from a 1674x speedup when asynchronously compiling to a C. Ahead-of-time Compilation Carac can compile query plans ahead-of-time, when Carac itself is compiled, reordering the join operations using the set of query facts and rules that are available. This Carac-compile- 9 Speedup over "unoptimized" time, or “offline” optimization since the cost of compilation is not included in the query execution time, is implemented with macros that utilize the same backend as Carac’s JIT Quotecompiler [32]. This offline sort optimization can be combined with continuous online join reordering by ahead-of-time compiling the IRGenerator backend. Because the execution engine uses Timsort [22] to reorder relations, sorting already-sorted inputs is done in linear time, and sorting inputs that have sorted subsets will perform better than completely unsorted inputs. Thus, even if not exactly correct but close, presorting offline will contribute to quicker online sorting and is key to why compile-time sorting, even if facts are not fully available, is worth doing even if online sorting is enabled. In Figure 10, we compare the speedup of query execution time over the “unoptimized” interpreted query (Table I) in the following configurations: “JIT-lambda”, Carac’s JIT with the lambda backend at the lowest granularity (same configuration as Figure 8), which does not have access to any facts or rules before execution begins; “Macro Facts+rules (online)” which has access to all facts and rules at Carac compiletime so will apply the sorting optimization using the input relational cardinalities. The generated code includes the online optimization using the IRGenerator. “Macro Rules (online)” works similarly, except it only has access to rules and no fact relation cardinalities at compile-time. As a result, the reordering of relations will depend only on the selectivity of the rules and not on the relation cardinalities. This configuration also injects the online sort into the code generated at compiletime. “Macro Facts+rules” does the compile-time reordering using all relations, but does not inject the online optimization. Finally, “Macro Rules” does the compile-time reordering using only the rules and does not generate code with the online optimization applied. In Figure 10, experiments that include continuous relation reordering after query execution begins are shown with a double-line border. Orthogonally, experiments that access no fact or rule information at Carac compile time are shown with a solid color, those that access rules only are spotted, and those that access rules and facts are striped. Generally, online reordering optimizations perform better than offline optimizations because the order of relations fed to the join can continuously update. However, continuously reordering relations has some overhead, and for programs for which the relation cardinalities do not change dramatically in comparison to each other, a single join order over the course of query execution may perform better than online reordering due to avoiding that overhead (illustrated in the case of the Primes benchmark). The JIT-lambda configuration performs comparatively, or sometimes better than the macros+online IRGenerator optimization because the lambda-generated code avoids the treetraversal overhead present in the macro configurations, as the macros are essentially compiling the interpreter. The macro experiments that gain access to fact relation cardinality at compile-time tend to perform better than those that rely only on rule selectivity, due to the cost of ingesting facts at runtime and a more optimal join-order due to knowledge JIT lambda 100 Macro Facts+rules (online) Macro Rules (online) 80 Macro Facts+rules 60 Macro Rules 40 20 0 Ackermann CBA Equal Benchmark Fib Primes Fig. 10. Ahead-of-time and Online Compilation of the relation cardinalities. However this is not always the case, for example in the Fibonacci benchmark, the Rules-only macro slightly outperforms the Macro Facts+Rules, a result of the join-ordering algorithm privileging relation cardinality over selectivity where, in reality, the selectivity is more influential on the overall execution time. In Figure 12 and 11 the speedup of the interpreted hand-optimized programs is compared with the speedup of compile-time-only optimizations (“Macro Rules” and “Macro Facts+rules”, in the same configuration as the corresponding bars in Figure 10). While not as effective as runtime-optimized programs, compile-time optimized programs show at best a 1100x speedup for the Inverse Functions query, and can even slightly out-perform the hand-optimized programs for smaller programs like Equal, Fibonacci, and Primes, in part due to the use of macros. To conclude, for most but not all benchmarks, the earlier information is known, the better the optimizations applied, and combining compile-time and run-time optimizations perform better than each individually. Whether or not these conclusions hold for specific benchmarks is a property of the query itself and how relation cardinalities change over the course of query execution. These properties are difficult to predict statically, especially without access to the full fact and rule relations, so queries with varying properties are presented in this subsection. However, regardless of the query, all presented configurations outperform the baseline, unoptimized queries: by over 1100x with facts and rules, and over 480x with only rules. D. Comparison with State-of-the-art To get a sense for how Carac compares against existing Datalog engines, in this section we compare the execution time of equivalent Carac queries with those of a state-of-theart Datalog execution engine. The current state-of-the-art for Datalog evaluation is Soufflé [16], which ingests an extension of Datalog and transforms the input program through a series of partial evaluations, generating template-heavy C++ programs that get ultimately compiled into a binary executable. Soufflé and Carac both generate an imperative program specialized to 10 Speedup over "unoptimized" 30 25 Interpreted hand-optimized Macro Facts+rules 20 Macro Rules language that allows for interoperability with Scala itself. In contrast, the Soufflé language is not an embedded DSL but its own Datalog-based language with extensions, including aggregation, user-defined functors, macros, and others. Soufflé provides auto-tuning capabilities via a profiling phase [31] that identifies optimal join orders. Figure 13 compares the execution time of various programs expressed in Soufflé and Carac, on Soufflé version 2.4 and Carac in “full” mode, run synchronously at a granularity that takes into account delta relations (σπ ⋊ ⋉). In all experiments, both Soufflé and Carac execute in a single-threaded context. Soufflé is invoked without any special arguments beyond silencing warnings, setting threads to 1, and emitting/using statistics for the experiments that include a profiling phase. All experiments include reading in facts and writing result facts to a tabseparated file. To facilitate generating a Graal Native Image for Carac, the experiments in Figure 13 use Java HotSpot(TM) 64Bit Server VM Oracle GraalVM 17.0.8+9.1 (build 17.0.8+9LTS-jvmci-23.0-b14, mixed mode, sharing) on Ubuntu 22.04 LTS with Linux kernel 5.15.0-27-generic. We compare Soufflé with auto-tuning, both when interpreting and compiling: first the profiling run included in the execution time (“profile-compile”, “profile-interp”), then without profiling included, using a profile generated aheadof-time on the same data the program runs over (“preprofiledcompile”, “preprofiled-interp”). We include Carac invoked in a few different ways to represent different usages of Carac: first, as an executable JAR (“jar-”), which most closely mimics how Soufflé is invoked. Next, we use GraalVM to generate a Graal Native Image (“native-lambda”), which is run as a native executable. GraalVM does not support generating a Graal Native Image for backends that require runtime classloading, as the Bytecode and Quotes do. Lastly, we include the same experiments for Carac but using a warmed-up JVM and warmed-up Scala compiler (“warm-”), which most closely mimics how we would expect Carac to be invoked. The “warm” Carac experiments generally perform better than the JAR experiments due to the cost of executing on a cold JVM/Scala compiler. The speedup can be significant due to aggressive inlining (for example, the average is 170x faster for the Lambda backend for the experiments shown in Figure 13), or it can be minimal for modes that benefit less from JVM optimizations (for example, the Quotes backend is only 7x and the Bytecode backend is 30x). The Graal Native Image outperforms the JAR for small programs (average 22x), but for longer-running programs like Points-To and Inverse Functions, the Carac program execution time dominates the total runtime, so the Graal Native Image does not show any benefit over the JAR. It also does not show benefit over the warm experiments. Carac outperforms Soufflé when compiling, both with and without the profiling phase included, for the experiments in Figure 13 because Carac can avoid the cost of invoking a full C++ compilation, which dominates Soufflé’s execution time. For the longer-running programs like Inverse Functions, at best Carac can outperform the “profile-compile” by 86x (Lambda, 15 10 5 0 Ackermann CBA Equal Benchmark Fib Primes Fig. 11. Ahead-of-time Only Compilation (small) Speedup over "unoptimized" 2500 Interpreted hand-optimized Macro Facts+rules Macro Rules 2000 1500 1000 500 0 Points-To Benchmark Inverse Functions Fig. 12. Ahead-of-time Only Compilation — Soufflé — Execution Engine & Compilation Mode carac preprofiled-interp profile-interp profile-compile warm-quotes Inverse Functions preprofiled-compile — Carac — Points-To warm-lambda CBA warm-bytecode jar-quotes jar-lambda Ackermann native-lambda 70 60 50 40 30 20 10 0 jar-bytecode Execution Time (s) the input Datalog program in the form of an intermediate representation comprising of relational algebra operators, control flow statements, and relation management operations. Soufflé supports interpretation mode and compilation mode, while Carac’s JIT alternates, depending on runtime information, between interpretation of the intermediate representation and code generation using Multi-Stage programming. Further, Carac supports a Datalog-based DSL embedded in the Scala souffle Fig. 13. Execution time Soufflé and Carac 11 warmed-up) and at worse by 3x (Quotes, JAR). Without profiling included (“preprofiled-compile”) Carac executes at best 36x and worst 1.4x faster. For smaller programs like Ackermann and CBA, Carac (Lambda, warmed-up) outperforms Soufflé’s interpreter without profiling (“preprofiled-interp”) by 14x and 32x respectively, and with profiling (“profile-interpreter”) by 30x and 64x respectively. For Points-To and Inverse Functions, Soufflé’s “preprofiled-interp” outperforms Carac’s best execution time (Lambda, warmed-up) by 3.3x and 3.6x. With the profiling phase included, Soufflé’s interpreter outperforms Carac by 1.4x and 1.5x. However, it is an imperfect comparison because the Soufflé variant of Datalog is more expressive and may spend additional cycles for language features that the programs do not utilize, which would help Carac outperform Soufflé for some experiments. Additionally, Soufflé implements many orthogonal optimizations that Carac does not currently support, for example auto-indexing, specialized B-tree implementation, relation inlining, and others that would help Soufflé outperform Carac for other experiments. Although the comparison is not perfect, it can be concluded from Figure 13 that for the longest-running of the presented programs, Soufflé’s best auto-tuned execution time outperforms Carac’s best execution time by at most 3.6x. Ultimately, the goal of Carac is not to present an alternative to Soufflé but to illustrate a potential runtime optimization that leverages advanced programming language capabilities like quotes & splices that would enable Soufflé to avoid a profiling phase entirely, which for large programs can be significant. Carac can tightly knit a popular general-purpose programming language, Scala, with declarative Datalog queries while still delivering execution times that are competitive with the most advanced Datalog systems. [36], but RecStep [37] optimizes Datalog subquery execution at every iteration by using runtime statistics, including reading relation cardinalities and switching the build and probe sides for their hash-join implementation. Carac’s approach differs in that it leverages code generation and reorders the entire n-way join once per-iteration, per-predicate, or per-subquery, not per 2-way join. b) Adaptive Query Processing. Just-in-time code-generating database systems like HyPer, HIQUE, Hekaton, ViDa, and many others [8]–[10] use SQL as a declarative specification to generate programs at query compile-time. There are several JVM-based bytecodegenerating DBMSs, for example Presto [38] and Spark [39]. Generally, once the code for the queries is generated, they are not re-generated within the execution of a single query by the generated program. Adaptive Query Processing (AQP), however, which uses runtime statistics to inform query evaluation decisions to avoid the well-known inaccuracy of cost model-based estimation [6], has received significant attention from the database community [7]. Notably, NoisePage [11] and Umbra [12] combine AQP with code generation to modify generated query plans at runtime: NoisePage with manipulation of function pointer arrays and Umbra by embedding code fragments for all possible execution variants into the generated code and using LLVM’s indirectbr instruction to jump to label addresses at runtime. Most code-generating AQP systems target general-purpose SQL, while the recursive nature of Datalog queries makes the primary considerations for query optimization different [31], so Soufflé uses runtime information to inform join optimization decisions but requires an offline profiling stage. Carac’s goal of applying optimizations at the level of abstraction most appropriate for the optimization is shared by MLIR [40], an intermediate representation that allows co-optimizing multiple representation levels. Similarly to MLIR and verified lifting, Carac follows a multi-representation strategy and lifts input programs but additionally incorporates runtime information to generate optimal execution strategies. VII. R ELATED WORK Program optimization and runtime analysis have received attention from both the database and compiler communities. a) Datalog. Datalog has been used to solve fixed-point problems within programming languages, for example as the basis of Ascent, a Datalog-based logic programming language embedded in Rust for use in the borrow checker [33]. Datalog has also recently been used for Java pointer and taint analysis in Doop [1], TensorFlow [2], Ethereum Virtual Machine [3], and AWS network analysis [4]. Datalog is the logical foundation for Dedalus [34], a language designed to model distributed systems upon which the Bloom language is founded. Bloom uses Datalog as a basis to formally guarantee consistency properties of distributed systems. The HYDRO language stack [35] takes this approach a step further and argues that the next programming paradigm for cloud computing should involve using verified lifting to translate user-written programs into a logic-based internal representation language that has roots in Datalog. Flix [17] is a declarative programming language that integrates firstclass Datalog constraints into the language. Several Datalog engines do not implement any join-reordering optimizations VIII. C ONCLUSION In this work, we introduced Adaptive Metaprogramming, a novel approach that combines the power of codegenerating just-in-time data management systems with principled metaprogramming to optimize recursive queries at compile and runtime. Unlike existing methods, Adaptive Metaprogramming enables code generation and optimization to continue past the start of query execution, potentially deoptimizing when necessary, to better adapt to real-time data changes. Through Carac, we demonstrate that Adaptive Metaprogramming can enhance the performance of Datalog queries substantially, achieving a speed increase of three orders of magnitude over unoptimized queries and 4x over handtuned queries by optimizing the join order of conjunctive queries during Datalog execution. 12 R EFERENCES [19] L. D. K Hoder, N Bjørner, “Uz- an efficient engine for fixed points with constraints,” in Lecture Notes in Computer Science, vol. 6806, 2011, pp. 457–462, 23rd International Conference on Computer Aided Verification, CAV 2011 ; Conference date: 01-07-2011. [20] A. Herlihy, P. Chrysogelos, and A. Ailamaki, “Boosting efficiency of external pipelines by blurring application boundaries,” in 12th Conference on Innovative Data Systems Research, CIDR 2022, Chaminade, January 11-15, 2022, Online Proceedings. Chaminade, CA: www.cidrdb.org, 2022. [21] V. Leis, B. Radke, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neumann, “Query optimization through the looking glass, and what we found running the join order benchmark,” The VLDB Journal, vol. 27, no. 5, p. 643–668, oct 2018. [Online]. Available: https://doi.org/10.1007/s00778-017-0480-7 [22] P. McIlroy, “Optimistic sorting and information theoretic complexity,” in Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, ser. SODA ’93. USA: Society for Industrial and Applied Mathematics, 1993, p. 467–474. [23] Stucki, Nicolas Alexander, “Scalable metaprogramming in scala 3,” InfoScience EPFL, 2023. [Online]. Available: http://infoscience.epfl.ch/record/299370 [24] Y. Futamura, “Partial evaluation of computation process–an approach to a compiler-compiler,” Higher-Order and Symbolic Computation, vol. 12, pp. 381–391, 1999. [25] A. Kennedy and C. V. Russo, “Generalized algebraic data types and object-oriented programming,” SIGPLAN Not., vol. 40, no. 10, p. 21–40, oct 2005. [Online]. Available: https://doi.org/10.1145/1103845.1094814 [26] OpenJDK, “Jep draft: Class-file api (preview).” [Online]. Available: https://openjdk.org/jeps/8280389 [27] A. Shipilev, S. Kuksenko, A. Astrand, S. Friberg, and H. Loef, “Openjdk code tools: Jmh,” 2022. [Online]. Available: http://openjdk.java.net/projects/code-tools/jmh/ [28] S. Center, “Tasty-query.” [Online]. Available: https://github.com/scalacenter/tasty-query [29] A. Møller and M. I. Schwartzbach, “Static program analysis,” October 2018, department of Computer Science, Aarhus University, https://cs.au.dk/ amoeller/spa/spa.pdf. [30] Y. Sundblad, “The ackermann function. a theoretical, computational, and formula manipulative study,” BIT, vol. 11, no. 1, p. 107–119, mar 1971. [Online]. Available: https://doi.org/10.1007/BF01935330 [31] S. Arch, X. Hu, D. Zhao, P. Subotić, and B. Scholz, “Building a join optimizer for soufflé,” in Logic-Based Program Synthesis and Transformation, A. Villanueva, Ed. Cham: Springer International Publishing, 2022, pp. 83–102. [32] N. Stucki, A. Biboudis, and M. Odersky, “A practical unification of multi-stage programming and macros,” SIGPLAN Not., vol. 53, no. 9, p. 14–27, nov 2018. [Online]. Available: https://doi.org/10.1145/3393934.3278139 [33] A. Sahebolamri, T. Gilray, and K. Micinski, “Seamless deductive inference via macros,” in Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction, ser. CC 2022. New York, NY, USA: Association for Computing Machinery, 2022, p. 77–88. [Online]. Available: https://doi.org/10.1145/3497776.3517779 [34] P. Alvaro, W. R. Marczak, N. Conway, J. M. Hellerstein, D. Maier, and R. Sears, “Dedalus: Datalog in time and space,” in Datalog Reloaded, O. de Moor, G. Gottlob, T. Furche, and A. Sellers, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 262–281. [35] A. Cheung, N. Crooks, J. M. Hellerstein, and M. Milano, “New directions in cloud programming,” in 11th Conference on Innovative Data Systems Research, CIDR 2021, Virtual Event, January 11-15, 2021, Online Proceedings. www.cidrdb.org, 2021. [Online]. Available: http://cidrdb.org/cidr2021/papers/cidr2021 paper16.pdf [36] B. Ketsman and P. Koutris, “Modern datalog engines,” Foundations and Trends® in Databases, vol. 12, no. 1, pp. 1–68, 2022. [Online]. Available: http://dx.doi.org/10.1561/1900000073 [37] Z. Fan, J. Zhu, Z. Zhang, A. Albarghouthi, P. Koutris, and J. M. Patel, “Scaling-up in-memory datalog processing: Observations and techniques,” Proc. VLDB Endow., vol. 12, no. 6, p. 695–708, feb 2019. [Online]. Available: https://doi.org/10.14778/3311880.3311886 [38] R. Sethi, M. Traverso, D. Sundstrom, D. Phillips, W. Xie, Y. Sun, N. Yegitbasi, H. Jin, E. Hwang, N. Shingte, and C. Berner, “Presto: Sql on everything,” in 2019 IEEE 35th International Conference on Data Engineering (ICDE), 2019, pp. 1802–1813. [1] Y. Smaragdakis and M. Bravenboer, “Using datalog for fast and easy program analysis,” in Proceedings of the First International Conference on Datalog Reloaded, ser. Datalog’10. Berlin, Heidelberg: Springer-Verlag, 2010, p. 245–251. [Online]. Available: https://doi.org/10.1007/978-3-642-24206-9 14 [2] S. Lagouvardos, J. T. Dolby, N. Grech, A. Antoniadis, and Y. Smaragdakis, “Static analysis of shape in tensorflow programs,” in ECOOP, 2020. [3] N. Grech, L. Brent, B. Scholz, and Y. Smaragdakis, “Gigahorse: Thorough, declarative decompilation of smart contracts,” in 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), 2019, pp. 1176–1186. [4] J. Backes, S. Bayless, B. Cook, C. Dodge, A. Gacek, A. Hu, T. Kahsai, B. Kocik, E. Kotelnikov, J. Kukovec, S. McLaughlin, J. Reed, N. Rungta, J. Sizemore, M. Stalzer, P. Srinivasan, P. Subotic, C. Varming, and B. Whaley, Reachability Analysis for AWS-Based Networks. Springer International Publishing, 07 2019, pp. 231–241. [5] J. D. Ullman, “Bottom-up beats top-down for datalog,” in Proceedings of the Eighth ACM SIGACT-SIGMOD-SIGART PODS, ser. PODS ’89. New York, NY, USA: ACM, 1989, p. 140–149. [Online]. Available: https://doi.org/10.1145/73721.73736 [6] V. Leis, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neumann, “How good are query optimizers, really?” Proc. VLDB Endow., vol. 9, no. 3, p. 204–215, nov 2015. [Online]. Available: https://doi.org/10.14778/2850583.2850594 [7] A. Deshpande, J. M. Hellerstein, and V. Raman, “Adaptive query processing: why, how, when, what next,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, USA, June 27-29, 2006, 2006, pp. 806–807. [Online]. Available: https://doi.org/10.1145/1142473.1142603 [8] T. Kersten, V. Leis, A. Kemper, T. Neumann, A. Pavlo, and P. Boncz, “Everything you always wanted to know about compiled and vectorized queries but were afraid to ask,” Proc. VLDB Endow., vol. 11, no. 13, p. 2209–2222, sep 2018. [Online]. Available: https://doi.org/10.14778/3275366.3284966 [9] A. Kohn, V. Leis, and T. Neumann, “Adaptive execution of compiled queries,” in Adaptive Execution of Compiled Queries, 04 2018, pp. 197– 208. [10] M. Karpathiotakis, I. Alagiannis, T. Heinis, M. Branco, and A. Ailamaki, “Just-in-time data virtualization: Lightweight data management with vida,” in Just-In-Time Data Virtualization: Lightweight Data Management with ViDa, 01 2015. [11] P. Menon, A. Ngom, L. Ma, T. C. Mowry, and A. Pavlo, “Permutable compiled queries: Dynamically adapting compiled queries without recompiling,” Proc. VLDB Endow., vol. 14, no. 2, p. 101–113, nov 2020. [Online]. Available: https://doi.org/10.14778/3425879.3425882 [12] T. Schmidt, P. Fent, and T. Neumann, “Efficiently compiling dynamic code for adaptive query processing,” 13th Workshop on Accelerating Analytics and Data Management, Sep 2022. [13] W. Taha, A Gentle Introduction to Multi-Stage Programming, Part II. Berlin, Heidelberg: Springer-Verlag, 2007, p. 260–290. [Online]. Available: https://doi.org/10.1007/978-3-540-88643-3 6 [14] G. Gottlob, S. Ceri, and L. Tanca, “What you always wanted to know about datalog (and never dared to ask),” IEEE Transactions on Knowledge and Data Engineering, vol. 1, no. 01, pp. 146–166, jan 1989. [15] J. D. Ullman, “Bottom-up beats top-down for datalog,” in ACM SIGACTSIGMOD-SIGART Symposium on Principles of Database Systems, 1989. [16] B. Scholz, H. Jordan, P. Subotić, and T. Westmann, “On fast large-scale program analysis in datalog,” in Proceedings of the 25th International Conference on Compiler Construction, ser. CC 2016. New York, NY, USA: ACM, 2016, p. 196–206. [Online]. Available: https://doi.org/10.1145/2892208.2892226 [17] M. Madsen, M.-H. Yee, and O. Lhoták, “From datalog to flix: A declarative language for fixed points on lattices,” SIGPLAN Not., vol. 51, no. 6, p. 194–208, jun 2016. [Online]. Available: https://doi.org/10.1145/2980983.2908096 [18] J. Whaley, D. Avots, M. Carbin, and M. S. Lam, “Using datalog with binary decision diagrams for program analysis,” in Proceedings of the Third Asian Conference on Programming Languages and Systems, ser. APLAS’05. Berlin, Heidelberg: Springer-Verlag, 2005, p. 97–118. [Online]. Available: https://doi.org/10.1007/11575467 8 13 [39] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, ser. NSDI’12. USA: USENIX Association, 2012, p. 2. [40] C. Lattner, M. Amini, U. Bondhugula, A. Cohen, A. Davis, J. A. Pienaar, R. Riddle, T. Shpeisman, N. Vasilache, and O. Zinenko, “MLIR: scaling compiler infrastructure for domain specific computation,” in IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2021, Seoul, South Korea, February 27 - March 3, 2021, J. W. Lee, M. L. Soffa, and A. Zaks, Eds. IEEE, 2021, pp. 2–14. [Online]. Available: https://doi.org/10.1109/CGO51591.2021.9370308 14