References for this article: Data Structure (C Language Edition) by Li Yunqing et al., Introduction to Algorithms introduction: In text editing, we often need to find a specific character or pattern at a specific position in a text. This leads to the problem of string matching. This article starts with a simple string matching algorithm, then goes through the Rabin-Karp algorithm and finally the KMP algorithm, teaching you to thoroughly understand the KMP algorithm from beginning to end. Let's look at the definition of this string problem in the book Introduction to Algorithms: Assume that text is an array T[1...n] of length n, and pattern is an array P[1....m] of length m<=n. Further assume that the elements of P and T are characters belonging to a finite alphabet Σ. Based on the above figure, let's explain the string matching problem. The goal is to find all occurrences of the pattern P=abaa in the text T=abcabaabcaabac. This pattern occurs only once in the text, at displacement s = 3. Displacement s = 3 is a valid displacement. 1. Simple string matching algorithm A simple string matching algorithm uses a loop to find all valid displacements. The loop checks the condition P[1....m]=T[s+1....s+m] for each of the n-m+1 possible values of s. NAIVE-STRING-MATCHER(T, P) 1 n ← length[T] 2 m ← length[P] 3 for s ← 0 to n - m 4 do if P[1 ‥ m] = T[s + 1 ‥ s + m] //For each value of n-m+1 possible displacement s, the loop comparing the corresponding characters must be executed m times. 5 then print "Pattern occurs with shift" s Simple string matching algorithm, the figure above is for the text T=acaabc and the pattern P=aab. In the fourth line of code above, for each value of the n-m+1 possible displacement s, the loop comparing the corresponding characters must be executed m times. So, in the worst case, the running time of this simple pattern matching algorithm is O((n-m+1)m). -------------------------------- Let me give you another specific example and a specific running program: For the target string is banananobano, and the pattern string to be matched is nano, The following is the matching process. The principle is very simple. Just compare it with the first character of the target string. If they are the same, compare the next one. If they are different, move the pattern right. Then, each character of the pattern is compared. The operation process of this algorithm is shown in the figure below. //Index represents every n matching situations.
The complexity of the above algorithm is O(pattern_length*target_length), Where do we mainly waste our time? Looking at the step index = 2, we have matched 3 characters, but the 4th character is not matched. At this time, the character sequence we have matched is nan. If we move one position to the right, the character sequence that nan*** matches will be an, which is definitely not a match. Then shift right one position again, and the matched value is nan***. The matched sequence is n, which can be matched. If we know the information of the pattern itself in advance, we don't need to roll back the target_index every time a match fails. This kind of fallback wastes a lot of unnecessary time. If these properties of the pattern itself can be calculated in advance, Then, when there is a mismatch, the pattern can be moved directly to the next possible position. Omit the process that is impossible to match. As shown in the table above, we have a mismatch at index=2. At this time, we can directly move the pattern to the state of index=4. The kmp algorithm starts from this point. 2. KMP Algorithm 1. Overlay function (overlay_function) The covering function represents the properties of the pattern itself, and can be represented by the degree of self-coverage of all consecutive substrings of the pattern starting from the left. For example, the following string, abaabcaba Since the count starts at 0, a coverage function value of 0 indicates that there is 1 match. It is a matter of preference whether to start counting from 0 or from 0. Please adjust the value by yourself. -1 means no coverage. So what is coverage? Let's take a look at the definition in a more mathematical way. For example, for a sequence a0a1...aj-1 aj To find a k that satisfies a0a1...ak-1ak=aj-kaj-k+1...aj-1aj There is no larger k that satisfies this condition, that is, to find the largest k possible, so that the first k characters of the pattern match the last k characters, k should be as large as possible. The reason is that if there is a relatively large k, and we choose a smaller k that satisfies the conditions, Then when there is a mismatch, we will make the position where the pattern moves to the right larger, and there will be fewer matches at the positions where the pattern moves, so we will lose the possible matching results. For example, the following sequence, There is a mismatch in the red part. The correct result is when k=1, so shift the pattern right by 4 bits. If k=0 is selected, shifting right by 5 bits will result in an error. The method of calculating this overlay function can be recursive. It can be imagined that if for the first j characters of the pattern, if the overlay function value is k a0a1...ak-1ak=aj-kaj-k+1...aj-1aj Then for the first j+1 sequence characters of pattern, the following possibilities are possible: (1) pattern[k+1]==pattern[j+1] then overlay(j+1)=k+1=overlay(j)+1 ⑵ pattern[k+1]≠pattern[j+1] At this time, the corresponding overlay function can only be found in the substring of the first k+1 sub-symbols of pattern, h=overlay(k). If pattern[h+1]==pattern[j+1], then overlay(j+1)=h+1. Otherwise, repeat the process (2). Here is a code to calculate the coverage function:
The running results are: -1 -1 0 0 1 -1 0 1 2 Press any key to continue ---------------------------------------- 2. KMP algorithm With the coverage function, it is very simple to implement the kmp algorithm. Our principle is still to match from left to right, but when a mismatch occurs, we do not need to move the target_index back. The part that has been matched before the target_index can be reflected in the pattern itself. We only need to move the pattern_index. When j length mismatch occurs, just move the pattern to the right by j-overlay(j) length. If pattern_index==0 when there is a mismatch, it means that the first character of pattern does not match. At this time, you should add 1 to target_index and move it 1 bit to the right. OK, the following figure shows the process of the KMP algorithm (the red part is the execution process of the KMP algorithm): Ok, *** gives the C++ code for the KMP algorithm implementation:
3. The source of the kmp algorithm KMP is so ingenious, so how did it come about? Why did it take three people to come up with it? In fact, even if there is no KMP algorithm, people can find the same efficient algorithm in character matching. This algorithm is ultimately equivalent to the KMP algorithm, but the starting point of this algorithm is not the covering function, and it is not directly based on the internal principle of matching. The covering function calculated using this method is complex and difficult to understand. However, once this covering function is found, the efficiency of matching the same pattern in the future will be the same as KMP. In fact, the function found by this algorithm should not be called a covering function, because the issue of whether it is covered is not considered at all during the search process. After talking for so long, what is this method? This method is the famous Deterministic Finite State Automaton (DFA). The grammar that DFA can recognize is Type 3 grammar, also called regular grammar or regular grammar. Since it can recognize regular grammar, then recognizing a certain string is definitely not a problem (a certain string is a subset of a regular expression). There is a complete algorithm for how to construct DFA, which will not be introduced here. Using DFA to recognize a certain string is really a waste of talent. DFA can recognize more general regular expressions, and using the general method of constructing DFA to recognize a certain string, the overhead is too large. The value of the kmp algorithm lies in that it starts from the characteristics of the character matching problem itself, and cleverly uses the concept of covering function, which represents the characteristics of the pattern itself, to quickly and directly generate a DFA for recognizing strings. Therefore, for an algorithm like kmp, high school mathematics is enough to understand it, but if you want to design such an algorithm from scratch, you require a relatively deep mathematical foundation. Original text: http://www.2cto.com/kf/201104/87381.html Author's statement: July personally owns the copyright to this series of 24 classic algorithms. Please indicate the source when reprinting. |
<<: What impact will Android 5.0 bring to Android Wear?
>>: The changes are huge? What are the changes in Android 5.0?
When 2b products are being sold and acquiring cus...
The developers of the first Apple Watch game are ...
As a social method for the younger generation, &q...
The smart home industry was a little restless in ...
This article sorts out the five key indicators an...
Course Description: Brother Yong’s project is a f...
Will housing prices in Guangzhou fall in 2020? An...
The hot August gathers the enthusiasm of midsumme...
A few days ago, I shared an article - "Is yo...
In the current situation where paid content is ho...
What content is most likely to dominate TikTok ? ...
As a high-end mobile phone brand, Apple has alway...
In the blink of an eye, today is already the thir...
Many friends asked, what should I do if I have tr...
According to a report from Aurora, the total onli...