.. include:: .. _compilation: ################## Compiling Patterns ################## ******************* Building a Database ******************* The Hyperscan compiler API accepts regular expressions and converts them into a compiled pattern database that can then be used to scan data. The API provides three functions that compile regular expressions into databases: #. :c:func:`hs_compile`: compiles a single expression into a pattern database. #. :c:func:`hs_compile_multi`: compiles an array of expressions into a pattern database. All of the supplied patterns will be scanned for concurrently at scan time, with user-supplied identifiers returned when they match. #. :c:func:`hs_compile_ext_multi`: compiles an array of expressions as above, but allows :ref:`extparam` to be specified for each expression. Compilation allows the Hyperscan library to analyze the given pattern(s) and pre-determine how to scan for these patterns in an optimized fashion that would be far too expensive to compute at run-time. When compiling expressions, a decision needs to be made whether the resulting compiled patterns are to be used in a streaming, block or vectored mode: - **Streaming mode**: the target data to be scanned is a continuous stream, not all of which is available at once; blocks of data are scanned in sequence and matches may span multiple blocks in a stream. In streaming mode, each stream requires a block of memory to store its state between scan calls. - **Block mode**: the target data is a discrete, contiguous block which can be scanned in one call and does not require state to be retained. - **Vectored mode**: the target data consists of a list of non-contiguous blocks that are available all at once. As for block mode, no retention of state is required. To compile patterns to be used in streaming mode, the ``mode`` parameter of :c:func:`hs_compile` must be set to :c:member:`HS_MODE_STREAM`; similarly, block mode requires the use of :c:member:`HS_MODE_BLOCK` and vectored mode requires the use of :c:member:`HS_MODE_VECTORED`. A pattern database compiled for one mode (streaming, block or vectored) can only be used in that mode. The version of Hyperscan used to produce a compiled pattern database must match the version of Hyperscan used to scan with it. Hyperscan provides support for targeting a database at a particular CPU platform; see :ref:`instr_specialization` for details. ===================== Compile Pure Literals ===================== Pure literal is a special case of regular expression. A character sequence is regarded as a pure literal if and only if each character is read and interpreted independently. No syntax association happens between any adjacent characters. For example, given an expression written as :regexp:`/bc?/`. We could say it is a regular expression, with the meaning that character ``b`` followed by nothing or by one character ``c``. On the other view, we could also say it is a pure literal expression, with the meaning that this is a character sequence of 3-byte length, containing characters ``b``, ``c`` and ``?``. In regular case, the question mark character ``?`` has a particular syntax role called 0-1 quantifier, which has a syntax association with the character ahead of it. Similar characters exist in regular grammar like ``[``, ``]``, ``(``, ``)``, ``{``, ``}``, ``-``, ``*``, ``+``, ``\``, ``|``, ``/``, ``:``, ``^``, ``.``, ``$``. While in pure literal case, all these meta characters lost extra meanings expect for that they are just common ASCII codes. Hyperscan is initially designed to process common regular expressions. It is hence embedded with a complex parser to do comprehensive regular grammar interpretation. Particularly, the identification of above meta characters is the basic step for the interpretation of far more complex regular grammars. However in real cases, patterns may not always be regular expressions. They could just be pure literals. Problem will come if the pure literals contain regular meta characters. Supposing fed directly into traditional Hyperscan compile API, all these meta characters will be interpreted in predefined ways, which is unnecessary and the result is totally out of expectation. To avoid such misunderstanding by traditional API, users have to preprocess these literal patterns by converting the meta characters into some other formats: either by adding a backslash ``\`` before certain meta characters, or by converting all the characters into a hexadecimal representation. In ``v5.2.0``, Hyperscan introduces 2 new compile APIs for pure literal patterns: #. :c:func:`hs_compile_lit`: compiles a single pure literal into a pattern database. #. :c:func:`hs_compile_lit_multi`: compiles an array of pure literals into a pattern database. All of the supplied patterns will be scanned for concurrently at scan time, with user-supplied identifiers returned when they match. These 2 APIs are designed for use cases where all patterns contained in the target rule set are pure literals. Users can pass the initial pure literal content directly into these APIs without worrying about writing regular meta characters in their patterns. No preprocessing work is needed any more. For new APIs, the ``length`` of each literal pattern is a newly added parameter. Hyperscan needs to locate the end position of the input expression via clearly knowing each literal's length, not by simply identifying character ``\0`` of a string. Supported flags: :c:member:`HS_FLAG_CASELESS`, :c:member:`HS_FLAG_SINGLEMATCH`, :c:member:`HS_FLAG_SOM_LEFTMOST`. .. note:: We don't support literal compilation API with :ref:`extparam`. And for runtime implementation, traditional runtime APIs can still be used to match pure literal patterns. .. note:: If the target rule set contains at least one regular expression, please use traditional compile APIs :c:func:`hs_compile`, :c:func:`hs_compile_multi` and :c:func:`hs_compile_ext_multi`. The new literal APIs introduced here are designed for rule sets containing only pure literal expressions. *************** Pattern Support *************** Hyperscan supports the pattern syntax used by the PCRE library ("libpcre"), described at . However, not all constructs available in libpcre are supported. The use of unsupported constructs will result in compilation errors. The version of PCRE used to validate Hyperscan's interpretation of this syntax is 8.41 or above. ==================== Supported Constructs ==================== The following regex constructs are supported by Hyperscan: * Literal characters and strings, with all libpcre quoting and character escapes. * Character classes such as :regexp:`.` (dot), :regexp:`[abc]`, and :regexp:`[^abc]`, as well as the predefined character classes :regexp:`\\s`, :regexp:`\\d`, :regexp:`\\w`, :regexp:`\\v`, and :regexp:`\\h` and their negated counterparts (:regexp:`\\S`, :regexp:`\\D`, :regexp:`\\W`, :regexp:`\\V`, and :regexp:`\\H`). * The POSIX named character classes :regexp:`[[:xxx:]]` and negated named character classes :regexp:`[[:^xxx:]]`. * Unicode character properties, such as :regexp:`\\p{L}`, :regexp:`\\P{Sc}`, :regexp:`\\p{Greek}`. * Quantifiers: * Quantifiers such as :regexp:`?`, :regexp:`*` and :regexp:`+` are supported when applied to arbitrary supported sub-expressions. * Bounded repeat qualifiers such as :regexp:`{n}`, :regexp:`{m,n}`, :regexp:`{n,}` are supported with limitations. * For arbitrary repeated sub-patterns: *n* and *m* should be either small or infinite, e.g. :regexp:`(a|b){4}`, :regexp:`(ab?c?d){4,10}` or :regexp:`(ab(cd)*){6,}`. * For single-character width sub-patterns such as :regexp:`[^\\a]` or :regexp:`.` or :regexp:`x`, nearly all repeat counts are supported, except where repeats are extremely large (maximum bound greater than 32767). Stream states may be very large for large bounded repeats, e.g. :regexp:`a.{2000}b`. Note: such sub-patterns may be considerably cheaper if at the beginning or end of patterns and especially if the :c:member:`HS_FLAG_SINGLEMATCH` flag is on for that pattern. * Lazy modifiers (:regexp:`?` appended to another quantifier, e.g. :regexp:`\\w+?`) are supported but ignored (as Hyperscan reports all matches). * Parenthesization, including the named and unnamed capturing and non-capturing forms. However, capturing is ignored. * Alternation with the :regexp:`|` symbol, as in :regexp:`foo|bar`. * The anchors :regexp:`^`, :regexp:`$`, :regexp:`\\A`, :regexp:`\\Z` and :regexp:`\\z`. * Option modifiers: These allow behaviour to be switched on (with :regexp:`(?