About String Substring pattern matching The most classic and simple algorithm is the BP algorithm (Bruce Force).

BP algorithm

First, we need to define some contents for the following explanation:

  • Main string S and sub string T, and S.length>T.length
  • The physical position of the string starts from 0

The basic idea is: master string slave The first character matches the first character of the substring , always perform peer position comparison, and there are two situations:

  • Complete matching with substring from initial position ->matching succeeded
  • Failed to match the main string at the position i (the sub string position is j at this time) -> The main string matches the first character of the sub string again from the second position

Note: When matching fails return Go to the beginning of the main string and advance one bit (i - j+1)

As shown in the figure below:

Note: The green area indicates a successful match (the length is j) and the red area indicates a failed match (at the position of i).

Note: indicates the result of successful matching

The code of this algorithm is:

 int index(String s, String t){ for (int i = 0,j=0;i<s.length && j < j.length;){ If (s [i]==t [j]) {//The current match is successful j++; i++; }Else {//The location matching failed. I need to return to the original location and add 1 to detect the matching again i = i - j + 1; j = 0; } } //After finishing the cycle, judge whether the matching is successful If (j>=t.length) {//indicates that j has accessed the entire t substring, which means that the matching is successful Return i - j - 1;//Returns the location of the first successful match }else{ return -1; } }

Thinking: This algorithm is very consistent with our normal thinking, but what is the problem?

The problem is Some strings There will be "symmetry (similarity)" in itself, and we should use this property (the internal information of the string).

  • The "symmetry (similarity)" here is not general symmetry (general symmetry is that an image is symmetric about the X axis, so it folds along the X axis, and both sides can overlap). As shown in the figure below:

 image-20181010104903162

Here is the symmetry (similarity), as shown in the figure above, the red area and the green area are "symmetrical (similar), only Position offset by a distance

  • Some examples: such as substring abab , similar areas are ab

There are several notes on the symmetry of substrings:

  1. Most strings are more or less similar, If there is no "similarity", the BP algorithm still has optimization space (reduce unnecessary checks according to the non similarity of substrings, and this KMP part will be explained)
  2. The two similar areas of a substring cannot overlap completely and must be offset by a certain distance. For example, substring aaaa , similar areas are aaa (Similar areas are not aaaa , because it cannot overlap completely).

The core of KMP algorithm is:

  1. seek The length of the similar area of the "substring" in front of each character of the pattern substring.
  2. Use this similar area to reduce unnecessary check matching.

The specific KMP principle is explained below.

KMP algorithm

There are two properties of KMP algorithm that can optimize the search process:

  1. Where substring matching fails The "Kid String" in front And above Corresponding part of main string Are identical
  2. "Small string" has "similarity", that is, there are similar regions

Note: We will make some substrings before the substring matching failure as "substrings", the same below

As shown in the figure below:

Note: This figure corresponds to the figure of BP algorithm. The green is the area that has been successfully matched, and the red is the breakpoint where the matching failed.


We hypothesis The "substring" (i.e. the green matching success area) before the substring t [j] has similar areas:

And Zone 1=Zone 2 , as shown in the following figure, there are two similar areas in the substring:

If the BP algorithm is followed, the main string should be i-j+1 The position starts to match the first character of the substring. Namely judgment Area 3=? Zone 1

here Zone 2=Zone 3 (Because this is a green area, which has been matched successfully before), and Zone 1=Zone 2 (assumed similar area), so it is inevitable that Zone 1=Zone 3

So you can directly skip the comparison between area 3 and area 1, and the i pointer of the main string Do not need to return to i - j+1, and directly compare s [i] with t [j-1]

This is the simplest case. The length of the similar region is the length of the "small string" minus 1 (j-1).


In fact, the similar area may not be so long (the first case is so good)

If Zone 1= Zone 2 According to the above deduction, then Zone 3= Zone 1

Then the BP algorithm It is unnecessary to compare area 3 with area 1, but directly compare s [i-j+2] with t [0] , as shown in the following figure:

We hypothesis The similar area of small string is reduced to the current Zone 11 and Zone 22 (The length is j-2).

because Zone 11==Zone 22 , then Zone 11==Zone 33 Area 22 must==Area 33

be The i pointer of the main string still does not need to be returned. You can directly compare S [i] with S [j-2]


According to the above procedure,

hypothesis The similarity area of small string is reduced to j - 3, then The i pointer of the main string still does not need to be returned. You can directly compare S [i] with S [j-3]

……

hypothesis The small string has no similar region, then The i pointer of the main string still does not need to be returned. You can directly compare S [i] with S [0]

Compared with BP algorithm, we still reduce a series of checks and matches (such as S[i-j+1] ?= T[0] S[j-j+2] = ? T[0]


In the above analysis process, we know that KMP is different from BP algorithm in two key aspects:

  1. The i pointer of the main string does not need to return
  2. S [i] is directly compared with t [k], where k is the length of the similar region of the small string. The meaning of this k is that when the position matching of substring j fails, the pointer of substring j needs to move to the position of k in the next step (j=k) Then s [i] and t [k] start to detect a match.

The key problem is how to find the corresponding k for each position j of the pattern substring

That is, for each character (position j) of the pattern substring, the previous substring has a length of similar area (length k). We record this length in next [j]=k

The actual KMP algorithm is to calculate the values of all next arrays of pattern substrings at the beginning, and then perform matching detection, which is divided into two steps:

  1. Get the next array of pattern substrings
  2. When the matching fails, the pointer of the main string i does not need to return, and the pattern substring j moves to the next [j] position to continue the matching detection until the main string detection is completed (i>=s.length)

On the next array of pattern substrings According to the above analysis, it is easy to understand:

  1. (fix the value of the first two digits) next [0]=- 1; next [1] = 0;
  2. Then, starting from j=2, calculate the length of the similar region of the small string in front of the position.

Note: Here, next [0]=- 1 means that when the position of main string s [i] does not match that of sub string 0, the next step is to move the pointer of main string i forward one bit to continue to compare with that of sub string 0. The writing method of code is:

 If (j==- 1) {//The j=next [j] has been done previously. See the following complete code for details I++;//The pointer of main string i moves forward one bit J++;//j becomes 0, that is, matching detection with t [0] }

The code of KMP algorithm is given below:

 int KMPindex(String s,String t){ //1.  Find the next array of substrings int [] next = new int[MaxSize]; next[0] = -1; next[1] = 0;// Fixed value of the first two digits int k = -1; for(int j= 0; J<t.length;) {//Find the length of the similar region of the small string for each bit //It is easy to calculate the next array manually, but it is not easy to use code. The code on Li Chunbao's Data Structure is very exquisite, only using 4 lines //If you follow the code process in the textbook and find that you can get the correct results, you can't think of writing it like this //I don't want to take this part of the code for the postgraduate entrance exam. I won't go into it first, but I can do it by hand if (k == -1 || t[j] == t[k]){ j++; k++; next[j] = k; }else{ k = next[k]; } } //2.  Start matching. You can see that the matching process code is basically the same as the BP algorithm, but the processing is different where the matching error occurs for (int i = 0,j=0;i<s.length && j < j.length;){ If (j==- 1 | | s [i]==t [j]) {//The current match succeeds or fails and j=(next [j]==- 1) j++; i++; }Else {//The position matching fails. I will not move, and j will move to the next [j] position to continue comparing with s [i] j = next[j]; } } //After finishing the cycle, judge whether the matching is successful If (j>=t.length) {//indicates that j has accessed the entire t substring, which means that the matching is successful Return i - j - 1;//Returns the location of the first successful match }else{ return -1; } }

Note: The second step above is the writing method in the textbook, because the two cases of successful or failed matching and next [j]==- 1 are written together i++;j++ But it is not easy to understand. It can be changed to the following way, which is easier to understand. The execution efficiency is the same, and the essence is the same

 //2.  Start matching. You can see that the matching process code is basically the same as the BP algorithm, but the processing is different where the matching error occurs for (int i = 0,j=0;i<s.length && j < j.length;){ If (s [i]==t [j]) {//The current match is successful j++; i++; }Else {//The location matching failed, I will not move if(next[j] != - 1) {//j Move to the next [j] position to continue comparing with s [i] j = next[j];  }Else {//If next [j]==- 1, it means that the position where the matching failed is 0 of the substring, so I needs to move forward one bit, and j remains unchanged at 0 J=0;//This sentence can be removed because j must be 0 at this time i ++; } } } //After finishing the cycle, judge whether the matching is successful If (j>=t.length) {//indicates that j has accessed the entire t substring, which means that the matching is successful Return i - j - 1;//Returns the location of the first successful match }else{ return -1; }

The problem of KMP algorithm is that it ignores a special pattern substring, such as aaaaa , this seed string can also use its own Special values equal Property further simplifies the process of detection and matching.

We return to the KMP algorithm. When the substring fails to match at position j, (break=j, we use the break temporary variable to save the location of the matching failure) j=next [j], and then judge whether s [i] is equal to t [j].

We used to use Similarity+sub string is equal to main string , here we can also dig The special value is equal+j (j before the move is break). The position is not equal to the corresponding position i of the main string

Because t [break]= s[i], If t [j]=t [break] (where j is the moved j) , then t [j] must not be equal to s [i], so we can skip the step of comparing s [i] with t [j], and directly move j to next [j] (j=next [j]) again, and then compare s [i] with t [j].

summary : If the substring fails to match at position j, and t [next [j]]=t [j], then s [i] can be directly compared with t [next [j]].

As shown in the figure above, s [i] can be directly compared with t [next [j-i]].

The reason why the KMP algorithm needs to be improved is mentioned above. The specific improvement methods are shown below.

Improved KMP algorithm

According to the above analysis, we only need to calculate the next array Consider whether t [j] and t [next [j]] are equal

  • If equal, next [j]=next [next [j]];
  • If they are not equal, the next [j] value is still the previous calculation method (the length of the similar region of the small string)

To distinguish from the previous algorithm, we rename the next array to the nextval array.

in fact, The nextval array calculation includes the next array calculation , so when we calculate manually, we first calculate the next array, and then calculate the nextval array. The process is as follows:

Compare t [j] and t [next [j]] for equality:

  • If equal, next [j]=nextval [next [j]];

Note: (The next [j] value must be less than j, so the nextval [next [j]] value has been calculated before, because the length of the similar area of the substring must be less than the length of the substring (because it must be offset by a certain distance))

  • If not, nextval [j]=next [j]

This process is different from the above process:

The first method is that only nextval is an array. Although the length of the similar region of the substring will be calculated during the calculation, it is applicable to the algorithm code

The second method is to explicitly calculate the next array first and then the nextval array, which is convenient for manual calculation.

Code of KMP improved algorithm:

 int KMPindex(String s,String t){ //1.  Find the nextval array of substrings int [] nextval = new int[MaxSize]; nextval[0] = -1;// Fixed value of the previous digit int k = -1; for(int j= 0; J<t.length;) {//Find the length of the similar region of the small string for each bit //Same as KMP, it only requires manual calculation results, and the code has not been studied in depth } //2.  Start matching. You can see that the matching process code is basically the same as the BP algorithm, but the processing is different where the matching error occurs for (int i = 0,j=0;i<s.length && j < j.length;){ If (s [i]==t [j]) {//The current match is successful j++; i++; }Else {//The location matching failed, I will not move if(next[j] != - 1) {//j Move to the position of nextval [j] to continue comparing with s [i] j = nextval[j];  }Else {//If nextval [j]==- 1, it means that j will eventually move to position 0, and t [j]==t [0], so J=0;//This sentence must not be removed!! Because not only 0 position nextval [0]=- 1, but also other positions may be equal to - 1 i ++; } } } //After finishing the cycle, judge whether the matching is successful If (j>=t.length) {//indicates that j has accessed the entire t substring, which means that the matching is successful Return i - j - 1;//Returns the location of the first successful match }else{ return -1; } }

last

About the above personal experience, there are the following explanations:

  1. The code is manually typed in the editor, and has not been run by the code. It is likely that there is an error in the operation, and it is only used to show the algorithm ideas
  2. It can be seen that the KMP (and improved) algorithm is essentially dealing with the problem of how to move the j of the substring when the position matching of the substring j fails. So the meaning of the next (nextval) array is that if the position matching fails, where should the current position be moved
  3. KMP algorithm and BP algorithm are completely consistent in the process of successful matching
  4. Understand that there may be mistakes, please correct them
Last modification: January 30, 2023
Do you like my article?
Don't forget to praise or appreciate, let me know that you accompany me on the way of creation.