Optimizing prediction subroutine ================================ Treelite offers system-level optimizations to boost prediction performance. Notice that the model information is wholly preserved; the optimizations only affect the manner at which prediction is performed. .. contents:: Contents :local: :backlinks: none :depth: 2 Annotate conditional branches ----------------------------- This optimization analyzes and annotates every threshold conditions in the test nodes to improve performance. How to use ~~~~~~~~~~ The first step is to generate the **branch annotation record** for your ensemble model. Make sure to have your training data ready. .. code-block:: python # model = your ensemble model (object of type treelite.Model) # dmat = training data (object of type treelite.DMatrix) # Create annotator object annotator = treelite.Annotator() # Annotate branches by iterating over the training data annotator.annotate_branch(model=model, dmat=dmat, verbose=True) # Save the branch annotation record as a JSON file annotator.save(path='mymodel-annotation.json') To utilize the branch annotation record, supply the compiler parameter ``annotate_in`` when exporting the model: .. code-block:: python :emphasize-lines: 3, 8, 12 # Export a source directory model.compile(dirpath='./mymodel', verbose=True, params={'annotate_in': 'mymodel-annotation.json'}) # Export a source directory, packaged in a zip archive model.export_srcpkg(platform='unix', toolchain='gcc', pkgpath='./mymodel.zip', libname='mymodel.so', verbose=True, params={'annotate_in': 'mymodel-annotation.json'}) # Export a shared library model.export_lib(toolchain='gcc', libpath='./mymodel.so', verbose=True, params={'annotate_in': 'mymodel-annotation.json'}) Technical details ~~~~~~~~~~~~~~~~~ Rationale ^^^^^^^^^ Modern CPUs heavily rely on a technique known as `branch prediction `_, in which they "guess" the result of the conditional expression in each ``if``-``else`` branch ahead of time. Given a program .. code-block:: c if ( [conditional expression] ) { foo(); } else { bar(); } the CPU will pre-fetch the instructions for the function ``foo()`` if the given condition is likely to be true. On the other hand, if the condition is likely to be false, the CPU will pre-fetch the instructions for the function ``bar()``. It suffices to say that correctly predicting conditional branches has great impact on performance. Each time the CPU predicts a branch correctly, it can keep the instructions it had pre-fetched earlier. Each time the CPU fails to predict, it must throw away the pre-fetched instructions and fetch anew another set of instructions. If you'd like to learn more about the importance of branch prediction, read `this excellent introductory article from Stanford\ `_. The prediction subroutine for a decision tree ensemble is problematic, as it is replete with conditional branches that will have to be guessed well: .. code-block:: c /* A slice of prediction subroutine */ float predict_margin(const float* data) { float sum = 0.0f; if (!(data[0].missing != -1) || data[0].fvalue <= 9.5) { if (!(data[0].missing != -1) || data[0].fvalue <= 3.5) { if (!(data[10].missing != -1) || data[10].fvalue <= 0.74185) { if (!(data[0].missing != -1) || data[0].fvalue <= 1.5) { if (!(data[2].missing != -1) || data[2].fvalue <= 2.08671) { if ( (data[4].missing != -1) && data[4].fvalue <= 2.02632) { if (!(data[3].missing != -1) || data[3].fvalue <= 0.763339) { sum += (float)0.00758165; } else { sum += (float)0.0060202; } } else { if ( (data[1].missing != -1) && data[1].fvalue <= 0.0397456) { sum += (float)0.00415399; } else { sum += (float)0.00821985; } } /* and so forth... */ In fact, each threshold condition in the test nodes will need to be predicted. While CPUs lack adequate information to make good guesses on these conditions, we can help by providing that information. Mechanism for supplying the C compiler with branch information ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We predict the likelihood of each condition by counting the number of data points from the training data that satisfy that condition. See the diagram below for an illustration. .. plot:: :nofigs: from graphviz import Source source = r""" digraph branch_annotation { graph [fontname = "helvetica"]; node [fontname = "helvetica"]; edge [fontname = "helvetica"]; 0 [label="Feature 4224 < 2.0 ?", shape=box]; 1 [label="Leaf: -0.0198"]; 2 [label="Feature 13 < 0.848 ?", shape=box]; 5 [label="Feature 12 < 0.878 ?", shape=box]; 6 [label="Feature 5 < -0.384 ?", shape=box]; 11 [label=" ⋮", margin=-0.5, shape=none, fontsize=20]; 12 [label="Feature 8 < -0.231?", shape=box]; 13 [label="Leaf: -0.0194"]; 14 [label="Leaf: -0.0196"]; 25 [label="Leaf: -0.0194"]; 26 [label="Feature 1693 < 2 ?", shape=box]; 53 [label="Leaf: 0"]; 54 [label="Leaf: -0.0196"]; 0 -> 1 [labeldistance=2.0, labelangle=45, headlabel=3,092,211 >]; 0 -> 2 [labeldistance=2.0, labelangle=-45, headlabel=< No/Missing: 3,902,211>]; 2 -> 5 [labeldistance=2.0, labelangle=45, headlabel=3,342,535 >]; 2 -> 6 [labeldistance=2.0, labelangle=-45, headlabel=< No/Missing: 583,254>]; 5 -> 11 [labeldistance=2.0, labelangle=45, headlabel=2,878,952 >]; 5 -> 12 [labeldistance=2.0, labelangle=-45, headlabel=< No/Missing:
445,583>]; 6 -> 13 [labeldistance=2.0, labelangle=45, headlabel=266,188 >]; 6 -> 14 [labeldistance=2.0, labelangle=-45, headlabel=< No/Missing:
317,066>]; 12 -> 25 [labeldistance=2.0, labelangle=45, headlabel=257,828 >]; 12 -> 26 [labeldistance=2.0, labelangle=-45, headlabel=< No/Missing: 187,755>]; 26 -> 53 [labeldistance=2.0, labelangle=45, headlabel=4 >]; 26 -> 54 [labeldistance=2.0, labelangle=-45, headlabel=< No/Missing: 187,751>]; } """ Source(source, format='png').render('../_static/branch_annotation', view=False) Source(source, format='svg').render('../_static/branch_annotation', view=False) .. raw:: html

If a condition is true at least 50% of the time (over the training data), the condition is labeled as "expected to be true": .. code-block:: c /* expected to be true */ if ( __builtin_expect( [condition], 1 ) ) { ... } else { ... } On the other hand, if a condition is false at least 50% of the time, the condition is labeled as "expected to be false": .. code-block:: c /* expected to be false */ if ( __builtin_expect( [condition], 0 ) ) { ... } else { ... } .. note:: On the expression ``__builtin_expect`` The ``__builtin_expect`` expression is a compiler intrinsic to supply the C compiler with branch prediction information. Both `gcc `_ and `clang `_ support it. Unfortunately, Microsoft Visual C++ does not. To take advantage of branch annotation, make sure to use gcc or clang on the target machine. Use integer thresholds for conditions -------------------------------------- This optimization replaces all thresholds in the test nodes with integers so that each threshold condition performs integer comparison instead of the usual floating-point comparison. The thresholds are said to be **quantized** into integer indices. BEFORE: .. code-block:: c if (data[3].fvalue < 1.5) { /* floating-point comparison */ ... } AFTER: .. code-block:: c if (data[3].qvalue < 3) { /* integer comparison */ ... } How to use ~~~~~~~~~~ Simply add the compiler parameter ``quantize=1`` when exporting the model: .. code-block:: python :emphasize-lines: 3, 8, 12 # Export a source directory model.compile(dirpath='./mymodel', verbose=True, params={'quantize': 1}) # Export a source directory, packaged in a zip archive model.export_srcpkg(platform='unix', toolchain='gcc', pkgpath='./mymodel.zip', libname='mymodel.so', verbose=True, params={'quantize': 1}) # Export a shared library model.export_lib(toolchain='gcc', libpath='./mymodel.so', verbose=True, params={'quantize': 1}) Technical details ~~~~~~~~~~~~~~~~~ Rationale ^^^^^^^^^ On some platforms such as x86-64, replacing floating-point thresholds with integers helps improve performance by 1) **reducing executable code size** and 2) **improving data locality**. This is so because on these platforms, integer constants can be embedded as part of the comparison instruction, whereas floating-point constants cannot. Let's look at x86-64 platform. The integer comparison .. code-block:: c a <= 4 produces one assembly instruction: .. code-block:: nasm cmpl $4, 8(%rsp) ; 8(%rsp) contains the variable a Since the integer constant ``4`` got embedded into the comparison instruction `cmpl `_, we only had to fetch the variable ``a`` from memory. On the other hand, the floating-point comparison .. code-block:: c b < 1.2f produces two assembly instructions: .. code-block:: nasm movss 250(%rip), %xmm0 ; 250(%rip) contains the constant 1.2f ucomiss 12(%rsp), %xmm0 ; 12(%rsp) contains the variable b Notice that the floating-point constant ``1.2f`` did not get embedded into the comparison instruction `ucomiss `_. The constant had to be fetched (with `movss `_) into the register ``xmm0`` before the comparsion could take place. To summarize, * a floating-point comparison takes twice as many instructions as an integer comparsion, increasing the executable code size; * a floating-point comparison involves an extra fetch instruction (``movss``), potentially causing a `cache miss `_. **Caveats**. As we'll see in the next section, using integer thresholds will add overhead costs at prediction time. You should ensure that the benefits of integer comparisons outweights the overhead costs. Mechanism for mapping features ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When ``quantize`` option is enabled, Treelite will collect all thresholds occuring in the tree ensemble model. For each feature, one list will be generated that lists the thresholds in ascending order: .. code-block:: none /* example of how per-feature threshold list may look like */ Feature 0: [1.5, 6.5, 12.5] Feature 3: [0.15, 0.35, 1.5] Feature 6: [7, 9, 10, 135] Using these lists, we may convert any data point into integer indices via simple look-ups. For feature 0 in the example above, values will be mapped to integer indices as follows: .. code-block:: none Let x be the value of feature 0. Assign -1 if x < 1.5 Assign 0 if x == 1.5 Assign 1 if 1.5 < x < 6.5 Assign 2 if x == 6.5 Assign 3 if 6.5 < x < 12.5 Assign 4 if x == 12.5 Assign 5 if x > 12.5 Let's look at a specific example of how a floating-point vector gets translated into a vector of integer indices: .. code-block:: none feature id 0 1 2 3 4 5 6 [7, missing, missing, 0.2, missing, missing, 20 ] => [3, missing, missing, 1, missing, missing, 5 ] Since the prediction subroutine still needs to accept floating-point features, the features will be internally converted before actual prediction. If the prediction subroutine looked like below without ``quantize`` option, .. code-block:: c float predict_margin(const Entry* data) { /* ... Run through the trees to compute the leaf output score ... */ return score; } it will now have an extra step of mapping the incoming data vector into integers: .. code-block:: c :emphasize-lines: 2 float predict_margin(const Entry* data) { /* ... Quantize feature values in data into integer indices ... */ /* ... Run through the trees to compute the leaf output score ... */ return score; }