Exploring the Principles of Automation and Optimization Practice

Exploring the Principles of Automation and Optimization Practice

Overview of Automata

Programmers who use Linux development environment must have used Linux system tools such as sed, grep, and lex. sed and grep are important data stream search and processing tools in Linux. Lex is a widely used lexical analyzer generator in Linux, which is used for complex language parsing and compiler front-end development. Although these Linux system tools have different functions, they all implement an automaton inside to perform regular expression-based text search on input expectations. Automaton is an equivalent implementation of regular expressions.

From the perspective of computational theory, regular expressions and automata are strictly equivalent in theory. Regular expressions and automata have equivalent ability to define matching patterns. Regular expressions are formal expressions of matching patterns, while automata are expressions of computer-implemented matching patterns.

In the field of security detection and protection, intrusion detection systems (IDS), intrusion prevention systems (IPS), web application firewalls (WAF), etc. have extensively applied automaton technology to perform regular expression matching of network data streams to achieve detection and analysis of network packets.

IPS/IDS and WAF systems

Automatic machine technology is also widely used in DPI systems (Deep Packet Inspection, DPI) to parse and identify network packets.

DPI system for application identification

2. Regular expressions and automata

2.1 Introduction to Regular Expressions and Automata

In formal language and automata theory, regular expressions and finite automata are strictly equivalent in theory.

Equivalence of regular expressions and automata

Automata are divided into deterministic finite automata (DFA) and non-deterministic finite automata (NFA). In deterministic finite automata, for a given deterministic state and deterministic input, its state transition relationship is unique and there is only one active state at any time. On the contrary, in non-deterministic finite automata, for a given deterministic state and deterministic input, there may be multiple state transition relationships, and there may be multiple active states at a certain time. Non-deterministic finite automata are mainly divided into epsilon-NFA with epsilon transition and NFA without epsilon transition, and their classic representatives are Thompson NFA and Gluskov NFA.

Thompson NFA

The figure above shows the Thompson NFA that recognizes the regular expression (AB|CD)*AFF*. It can be seen that the Thompson NFA is composed of basic sub-regular expression NFA units connected by epsilon edges. The two states connected by the epsilon edge can be transferred by a blank character, that is, there is an unconditional state transfer relationship. The two states connected by the edge marked by letters indicate that the corresponding characters need to be entered between the two states to transfer the activation state.

Gluskov NFA

The figure above shows the Gluskov NFA that recognizes the regular expression (AB|CD)*AFF*. Obviously, the Gluskov NFA is much simpler than the Thompson NFA, and the number of NFA states in the Gluskov NFA is consistent with the total number of characters and character sets that appear in the regular expression. Compared with the Thompson NFA, the Gluskov NFA has fewer states and a more compact structure. At the same time, in the Gluskov NFA, the jump conditions between states are moved to the node and become the activation conditions of the node, so the runtime processing of the Gluskov NFA will also become simpler. When a character c is read in at runtime, the state set reach(c) that can be activated by the character c can be known. Then, it is only necessary to calculate the successor state set succ(s) based on the current activation state set s of the Gluskov NFA, and take the intersection of reach(c) and succ(s). The resulting intersection is the next activation state set of the Gluskov NFA. For Thompson NFA, the activation conditions of the nodes are not unique, and the state transition relationship of the epsilon edge connection needs to be processed, so the calculation of the activation state set at the next moment will be more complicated.

Using the subset construction algorithm on Thompson NFA or Gluskov NFA, NFA can be converted into DFA. The biggest advantage of DFA over NFA is performance, while the disadvantage is space overhead. This is because the determinism of DFA state transitions is obtained by combining different NFA states. Therefore, for functionally equivalent DFA and NFA, theoretically, the upper limit of the number of DFA states is exponential to the number of NFA states.

DFA state diagram

The figure above shows the state diagram after the Thompson NFA that recognizes the regular expression (AB|CD)*AFF* is converted into a DFA using the subset construction algorithm. The serial number in the serial number set of each blue box in the figure corresponds to the serial number of the state in the Thompson NFA, which shows that each state in the DFA corresponds to a subset of the NFA state set.

2.2 Mainstream open source automaton libraries

The mainstream open source automaton-related libraries currently in widespread use are mainly Pcre, RE2, and Hyperscan :

  • PCRE supports the most complete and complex regular expression syntax, but PCRE only supports block mode compilation and matching, and only supports the compilation and matching of a single regular expression, and its performance is the worst among the three software. For scenarios that require large-scale regular rule parallel matching, PCRE is incapable of meeting the requirements.
  • Google's open source regular matching engine RE2 is a fast, safe, and thread-friendly regular matching automaton implemented in C++ based on the virtual machine method. It supports fewer regular expression grammars than PCRE but more than Hyperscan. RE2 supports parallel matching of a small number of regular rule sets, but does not support regular expression grammars that can only be implemented using backtracking algorithms.
  • Hyperscan is an open source high-performance regular expression hybrid automaton based on regular expression NFA/DFA graph analysis and decomposition. Among the three software, Hyperscan supports the least regular grammar, but its performance is the strongest and supports parallel matching of large-scale regular rule sets.


3. Performance optimization practice of automata

The matching rate of regular expressions is an important performance bottleneck that restricts services such as IDS/IPS, WAF, and DPI. Improving the matching performance of regular expression automatons is the key to improving the above business capabilities. The following introduces several mainstream methods for optimizing automaton performance.

3.1 Performance optimization based on pre-filtering

Regular expression optimization strategy based on pre-filtering

The figure above shows a regular expression matching optimization strategy based on string matcher pre-filtering. This solution extracts string information from regular expressions during the compilation process of regular expressions, and builds a multi-string pre-matcher based on the extracted strings. For example, for rule 0, the string SEARCH is extracted, and for rule N, the string SUBSCRIBE is extracted. In the process of matching the expected input, the multi-string matcher is first used to match the string. If the string SERACH is matched during the matching process but the string SUBCRIBE is not matched, the automaton built according to regular expression rule 0 is further used to perform the second-stage regular expression matching. It can be seen that the regular expression matching solution based on pre-filtering is a two-stage matching process.

Although the pre-filtering regular expression matching scheme based on string matcher can filter out the unmatched corpus in advance, it still has the following shortcomings:

(1) There is repeated matching of the string in the regular expression, that is, the pre-filtered string matching component matches the string once, and the automaton matches the string again;

(2) In the second stage of the pre-filtering matching scheme, it is difficult to effectively use the CPU's SIMD instruction set to parallelly accelerate string matching by using automata to match strings.

(3) Improper selection of key strings can easily drag down the overall regular expression matching performance.

In view of the shortcomings of the regular expression matching scheme based on string matcher pre-filtering, a more novel and effective regular expression matching scheme based on regular expression decomposition came into being.

3.2 Performance optimization based on regular expression decomposition

The regular expression matching scheme based on regular expression decomposition will first decompose the regular expression into several substrings and sub-regular expressions. The decomposed substrings will be constructed as a string matcher (the string matcher can effectively use the CPU's SIMD instruction set for parallel acceleration, which has an order of magnitude performance advantage over using automata for string matching), and the decomposed sub-regular expressions will be constructed as a sub-automaton, such as NFA or DFA. When matching regular expressions on the input corpus, the scheme will call each matcher in a certain order, and try to give priority to calling the string matcher for string matching. Only when the current matcher matches successfully will the next matcher be called for matching, and only when all matchers match successfully in the given order will the entire decomposed regular expression be truly matched successfully.

Regular expression matching strategy based on rule splitting

The figure above shows an example of using the decomposed regular expression .*start[^x]comA+ to match the input string AstarZcomA. First, the regular expression is decomposed into five parts, corresponding to the automaton parts FA2, FA1, FA0 and the string parts STR2, STR1. The matching order of each sub-automaton and substring matcher constructed after decomposition is STR1->STR2->FA1->FA0->FA2. Each sub-automaton and substring constructed after decomposition follows the following priority principle:

  • String matching takes precedence over automaton matching.
  • An automaton match in the middle of two strings takes precedence over an automaton match elsewhere.
  • An automaton that matches the end of the corpus takes precedence over an automaton that matches the beginning of the corpus.

The first priority principle is easy to understand, because the string matching rate has an order of magnitude performance advantage over the automaton matching rate. The beginning and end of the line of the corpus that the automaton between the two strings needs to match are anchored, so its priority is higher than that of other automata, that is, priority principle 2. Since the beginning of the line that the automaton that matches the end of the corpus matches is anchored and does not require backtracking, its priority is higher, that is, priority principle 3. It can be seen that the matching order of each disassembled sub-automaton and the substring matcher follows in principle: the matching process with the smaller performance overhead has the higher matching order.

For the input corpus, AstarZcomA, it will first use the string matcher to match string STR1. At this time, the string match is successful, and then the string matcher will be called to match string STR2. At this time, string STR2 fails to match, and the subsequent FA1, FA0, and FA2 will no longer be used for matching. If the input string is AstartZcomA, it will successfully match STR1, STR2, FA1, FA0, and FA2 in sequence, and finally output the matching success information.

4. Application thinking of regular rule matching

In various developments and applications in the Internet field, a large number of scenarios such as network attack detection and application traffic identification require the use of regular engines to match regular expressions. The matching efficiency of regular expressions depends not only on the performance of the regular engine used, but also on the form of the written regular expression. Uncovering the implementation principles of the regular engine can give us a deeper understanding of the correlation between the form of regular expressions and the efficiency of the regular engine, and better guide us to perform regular engine performance tuning. The following principles of regular expression writing guidelines can help us match regular expressions more efficiently during development and application:

  1. Try to avoid using regular expression syntax that requires backtracking, such as backreference syntax. The introduction of backtracking will increase the time complexity of regular matching exponentially in the worst case.
  2. Try to avoid syntax such as (.*) and {min,max} in regular expressions. The uncertainty introduced by (.*) and the bounded repetition brought by {min,max} are important performance bottlenecks of the regular expression engine.
  3. Try to make regular expressions more specific, such as including more specific characters or strings in the regular expression.

<<:  Detailed explanation of Android virtual machine Dalvik and ART

>>:  Differences between X86 architecture and Arm architecture

Recommend

How to plan a marketing campaign that can create multiple values?

The World Cup craze has just passed, and problems...

How to build and implement data analysis indicators?

“ A universal and recognized indicator dictionary...

Brand promotion: Do you have marketing ideas?

Of all the definitions of creativity, I think the...

Farewell, Skoda

If a car brand with sentiment and history fails t...

Case Review | QQ Browser News New User Retention Growth Methodology

By reviewing a growth case I did last year - the ...

Japanese advertising is really worth learning

When talking about Japan, the first thing that co...

Honor X1 real hands-on experience

1. In actual use, there is no big problem in holdi...

Taklimakan is not a hopeless situation? The birth of desert ninjas!

Few people can live here for long. Even in today&...

LeTV TV "9·19": All doubts turned into jokes

The development of enterprises in the Internet er...