.. _chimera: ####### Chimera ####### This section describes Chimera library. ************ Introduction ************ Chimera is a software regular expression matching engine that is a hybrid of Hyperscan and PCRE. The design goals of Chimera are to fully support PCRE syntax as well as to take advantage of the high performance nature of Hyperscan. Chimera inherits the design guideline of Hyperscan with C APIs for compilation and scanning. The Chimera API itself is composed of two major components: =========== Compilation =========== These functions take a group of regular expressions, along with identifiers and option flags, and compile them into an immutable database that can be used by the Chimera scanning API. This compilation process performs considerable analysis and optimization work in order to build a database that will match the given expressions efficiently. See :ref:`chcompile` for more details ======== Scanning ======== Once a Chimera database has been created, it can be used to scan data in memory. Chimera only supports block mode in which we scan a single contiguous block in memory. Matches are delivered to the application via a user-supplied callback function that is called synchronously for each match. For a given database, Chimera provides several guarantees: * No memory allocations occur at runtime with the exception of scratch space allocation, it should be done ahead of time for performance-critical applications: - **Scratch space**: temporary memory used for internal data at scan time. Structures in scratch space do not persist beyond the end of a single scan call. * The size of the scratch space required for a given database is fixed and determined at database compile time. This means that the memory requirement of the application are known ahead of time, and the scratch space can be pre-allocated if required for performance reasons. * Any pattern that has successfully been compiled by the Chimera compiler can be scanned against any input. There could be internal resource limits or other limitations caused by PCRE at runtime that could cause a scan call to return an error. .. note:: Chimera is designed to have the same matching behavior as PCRE, including greedy/ungreedy, capturing, etc. Chimera reports both **start offset** and **end offset** for each match like PCRE. Different from the fashion of reporting all matches in Hyperscan, Chimera only reports non-overlapping matches. For example, the pattern :regexp:`/foofoo/` will match ``foofoofoofoo`` at offsets (0, 6) and (6, 12). .. note:: Since Chimera is a hybrid of Hyperscan and PCRE in order to support full PCRE syntax, there will be extra performance overhead compared to Hyperscan-only solution. Please always use Hyperscan for better performance unless you must need full PCRE syntax support. See :ref:`chruntime` for more details ************ Requirements ************ The PCRE library (http://pcre.org/) version 8.41 is required for Chimera. .. note:: Since Chimera needs to reference PCRE internal function, please place PCRE source directory under Hyperscan root directory in order to build Chimera. Beside this, both hardware and software requirements of Chimera are the same to Hyperscan. See :ref:`hardware` and :ref:`software` for more details. .. note:: Building Hyperscan will automatically generate Chimera library. Currently only static library is supported for Chimera, so please use static build type when configure CMake build options. .. _chcompile: ****************** Compiling Patterns ****************** =================== Building a Database =================== The Chimera compiler API accepts regular expressions and converts them into a compiled pattern database that can then be used to scan data. The API provides two functions that compile regular expressions into databases: #. :c:func:`ch_compile`: compiles a single expression into a pattern database. #. :c:func:`ch_compile_multi`: compiles an array of expressions into a pattern database. All of the supplied patterns will be scanned for concurrently at scan time, with user-supplied identifiers returned when they match. #. :c:func:`ch_compile_ext_multi`: compiles an array of expressions as above, but allows PCRE match limits to be specified for each expression. Compilation allows the Chimera library to analyze the given pattern(s) and pre-determine how to scan for these patterns in an optimized fashion using Hyperscan and PCRE. =============== Pattern Support =============== Chimera fully supports the pattern syntax used by the PCRE library ("libpcre"), described at .The version of PCRE used to validate Chimera's interpretation of this syntax is 8.41. ========= Semantics ========= Chimera supports the exact same semantics of PCRE library. Moreover, it supports multiple simultaneous pattern matching like Hyperscan and the multiple matches will be reported in order by end offset. .. _chruntime: ********************* Scanning for Patterns ********************* Chimera provides scan function with ``ch_scan``. ================ Handling Matches ================ ``ch_scan`` will call a user-supplied callback function when a match is found. This function has the following signature: .. doxygentypedef:: ch_match_event_handler :outline: :no-link: The *id* argument will be set to the identifier for the matching expression provided at compile time, and the *from* argument will be set to the start-offset of the match the *to* argument will be set to the end-offset of the match. The *captured* stores offsets of entire pattern match as well as captured subexpressions. The *size* will be set to the number of valid entries in the *captured*. The match callback function has the capability to continue or halt scanning by returning different values. See :c:type:`ch_match_event_handler` for more information. ======================= Handling Runtime Errors ======================= ``ch_scan`` will call a user-supplied callback function when a runtime error occurs in libpcre. This function has the following signature: .. doxygentypedef:: ch_error_event_handler :outline: :no-link: The *id* argument will be set to the identifier for the matching expression provided at compile time. The match callback function has the capability to either halt scanning or continue scanning for the next pattern. See :c:type:`ch_error_event_handler` for more information. ============= Scratch Space ============= While scanning data, Chimera needs a small amount of temporary memory to store on-the-fly internal data. This amount is unfortunately too large to fit on the stack, particularly for embedded applications, and allocating memory dynamically is too expensive, so a pre-allocated "scratch" space must be provided to the scanning functions. The function :c:func:`ch_alloc_scratch` allocates a large enough region of scratch space to support a given database. If the application uses multiple databases, only a single scratch region is necessary: in this case, calling :c:func:`ch_alloc_scratch` on each database (with the same ``scratch`` pointer) will ensure that the scratch space is large enough to support scanning against any of the given databases. While the Chimera library is re-entrant, the use of scratch spaces is not. For example, if by design it is deemed necessary to run recursive or nested scanning (say, from the match callback function), then an additional scratch space is required for that context. In the absence of recursive scanning, only one such space is required per thread and can (and indeed should) be allocated before data scanning is to commence. In a scenario where a set of expressions are compiled by a single "main" thread and data will be scanned by multiple "worker" threads, the convenience function :c:func:`ch_clone_scratch` allows multiple copies of an existing scratch space to be made for each thread (rather than forcing the caller to pass all the compiled databases through :c:func:`ch_alloc_scratch` multiple times). For example: .. code-block:: c ch_error_t err; ch_scratch_t *scratch_prototype = NULL; err = ch_alloc_scratch(db, &scratch_prototype); if (err != CH_SUCCESS) { printf("ch_alloc_scratch failed!"); exit(1); } ch_scratch_t *scratch_thread1 = NULL; ch_scratch_t *scratch_thread2 = NULL; err = ch_clone_scratch(scratch_prototype, &scratch_thread1); if (err != CH_SUCCESS) { printf("ch_clone_scratch failed!"); exit(1); } err = ch_clone_scratch(scratch_prototype, &scratch_thread2); if (err != CH_SUCCESS) { printf("ch_clone_scratch failed!"); exit(1); } ch_free_scratch(scratch_prototype); /* Now two threads can both scan against database db, each with its own scratch space. */ ================= Custom Allocators ================= By default, structures used by Chimera at runtime (scratch space, etc) are allocated with the default system allocators, usually ``malloc()`` and ``free()``. The Chimera API provides a facility for changing this behaviour to support applications that use custom memory allocators. These functions are: - :c:func:`ch_set_database_allocator`, which sets the allocate and free functions used for compiled pattern databases. - :c:func:`ch_set_scratch_allocator`, which sets the allocate and free functions used for scratch space. - :c:func:`ch_set_misc_allocator`, which sets the allocate and free functions used for miscellaneous data, such as compile error structures and informational strings. The :c:func:`ch_set_allocator` function can be used to set all of the custom allocators to the same allocate/free pair. ************************ API Reference: Constants ************************ =========== Error Codes =========== .. doxygengroup:: CH_ERROR :content-only: :no-link: ============= Pattern flags ============= .. doxygengroup:: CH_PATTERN_FLAG :content-only: :no-link: ================== Compile mode flags ================== .. doxygengroup:: CH_MODE_FLAG :content-only: :no-link: ******************** API Reference: Files ******************** ========== File: ch.h ========== .. doxygenfile:: ch.h ================= File: ch_common.h ================= .. doxygenfile:: ch_common.h ================== File: ch_compile.h ================== .. doxygenfile:: ch_compile.h ================== File: ch_runtime.h ================== .. doxygenfile:: ch_runtime.h