Understanding of KMP algorithm

Blogger: Friend C
Published: October 10, 2018
15054 views
12 comments
3692 words
Classification: data structure

About String Substring pattern matching The most classic and simple algorithm is the BP algorithm (Bruce Force).

BP algorithm

First, we need to define some contents for the following explanation:

Main string S and sub string T, and S.length>T.length
The physical position of the string starts from 0

The basic idea is: master string slave The first character matches the first character of the substring , always perform peer position comparison, and there are two situations:

Complete matching with substring from initial position ->matching succeeded
Failed to match the main string at the position i (the sub string position is j at this time) -> The main string matches the first character of the sub string again from the second position

Note: When matching fails return Go to the beginning of the main string and advance one bit (i - j+1) 。

As shown in the figure below:

Note: The green area indicates a successful match (the length is j) and the red area indicates a failed match (at the position of i).

Note: indicates the result of successful matching

The code of this algorithm is:

 int index(String s, String t){ for (int i = 0,j=0;i<s.length && j < j.length;){ If (s [i]==t [j]) {//The current match is successful j++; i++; }Else {//The location matching failed. I need to return to the original location and add 1 to detect the matching again i = i - j + 1; j = 0; } } //After finishing the cycle, judge whether the matching is successful If (j>=t.length) {//indicates that j has accessed the entire t substring, which means that the matching is successful Return i - j - 1;//Returns the location of the first successful match }else{ return -1; } }

Thinking: This algorithm is very consistent with our normal thinking, but what is the problem?

The problem is Some strings There will be "symmetry (similarity)" in itself, and we should use this property (the internal information of the string).

The "symmetry (similarity)" here is not general symmetry (general symmetry is that an image is symmetric about the X axis, so it folds along the X axis, and both sides can overlap). As shown in the figure below:

Here is the symmetry (similarity), as shown in the figure above, the red area and the green area are "symmetrical (similar), only Position offset by a distance 。

Some examples: such as substring abab , similar areas are ab 。

There are several notes on the symmetry of substrings:

Most strings are more or less similar, If there is no "similarity", the BP algorithm still has optimization space (reduce unnecessary checks according to the non similarity of substrings, and this KMP part will be explained)
The two similar areas of a substring cannot overlap completely and must be offset by a certain distance. For example, substring aaaa , similar areas are aaa (Similar areas are not aaaa , because it cannot overlap completely).

The core of KMP algorithm is:

seek The length of the similar area of the "substring" in front of each character of the pattern substring.
Use this similar area to reduce unnecessary check matching.

The specific KMP principle is explained below.

KMP algorithm

There are two properties of KMP algorithm that can optimize the search process:

Where substring matching fails The "Kid String" in front And above Corresponding part of main string Are identical
"Small string" has "similarity", that is, there are similar regions

Note: We will make some substrings before the substring matching failure as "substrings", the same below

As shown in the figure below:

Note: This figure corresponds to the figure of BP algorithm. The green is the area that has been successfully matched, and the red is the breakpoint where the matching failed.

We hypothesis The "substring" (i.e. the green matching success area) before the substring t [j] has similar areas:

And Zone 1=Zone 2 , as shown in the following figure, there are two similar areas in the substring:

If the BP algorithm is followed, the main string should be i-j+1 The position starts to match the first character of the substring. Namely judgment Area 3=? Zone 1 。

here Zone 2=Zone 3 (Because this is a green area, which has been matched successfully before), and Zone 1=Zone 2 (assumed similar area), so it is inevitable that Zone 1=Zone 3 。

So you can directly skip the comparison between area 3 and area 1, and the i pointer of the main string Do not need to return to i - j+1, and directly compare s [i] with t [j-1] 。

This is the simplest case. The length of the similar region is the length of the "small string" minus 1 (j-1).

In fact, the similar area may not be so long (the first case is so good)

If Zone 1= Zone 2 。 According to the above deduction, then Zone 3= Zone 1 。

Then the BP algorithm It is unnecessary to compare area 3 with area 1, but directly compare s [i-j+2] with t [0] , as shown in the following figure:

We hypothesis The similar area of small string is reduced to the current Zone 11 and Zone 22 (The length is j-2).

because Zone 11==Zone 22 , then Zone 11==Zone 33 （ Area 22 must==Area 33 ）

be The i pointer of the main string still does not need to be returned. You can directly compare S [i] with S [j-2] 。

According to the above procedure,

hypothesis The similarity area of small string is reduced to j - 3, then The i pointer of the main string still does not need to be returned. You can directly compare S [i] with S [j-3] 。

……

hypothesis The small string has no similar region, then The i pointer of the main string still does not need to be returned. You can directly compare S [i] with S [0] 。

Compared with BP algorithm, we still reduce a series of checks and matches (such as S[i-j+1] ?= T[0] S[j-j+2] = ? T[0] ）

In the above analysis process, we know that KMP is different from BP algorithm in two key aspects:

The i pointer of the main string does not need to return
S [i] is directly compared with t [k], where k is the length of the similar region of the small string. The meaning of this k is that when the position matching of substring j fails, the pointer of substring j needs to move to the position of k in the next step (j=k) Then s [i] and t [k] start to detect a match.

The key problem is how to find the corresponding k for each position j of the pattern substring 。

That is, for each character (position j) of the pattern substring, the previous substring has a length of similar area (length k). We record this length in next [j]=k

The actual KMP algorithm is to calculate the values of all next arrays of pattern substrings at the beginning, and then perform matching detection, which is divided into two steps:

Get the next array of pattern substrings
When the matching fails, the pointer of the main string i does not need to return, and the pattern substring j moves to the next [j] position to continue the matching detection until the main string detection is completed (i>=s.length)

On the next array of pattern substrings According to the above analysis, it is easy to understand:

(fix the value of the first two digits) next [0]=- 1; next [1] = 0;
Then, starting from j=2, calculate the length of the similar region of the small string in front of the position.

Note: Here, next [0]=- 1 means that when the position of main string s [i] does not match that of sub string 0, the next step is to move the pointer of main string i forward one bit to continue to compare with that of sub string 0. The writing method of code is:

 If (j==- 1) {//The j=next [j] has been done previously. See the following complete code for details I++;//The pointer of main string i moves forward one bit J++;//j becomes 0, that is, matching detection with t [0] }

The code of KMP algorithm is given below:

 int KMPindex(String s,String t){ //1.  Find the next array of substrings int [] next = new int[MaxSize]; next[0] = -1; next[1] = 0;// Fixed value of the first two digits int k = -1; for(int j= 0; J<t.length;) {//Find the length of the similar region of the small string for each bit //It is easy to calculate the next array manually, but it is not easy to use code. The code on Li Chunbao's Data Structure is very exquisite, only using 4 lines //If you follow the code process in the textbook and find that you can get the correct results, you can't think of writing it like this //I don't want to take this part of the code for the postgraduate entrance exam. I won't go into it first, but I can do it by hand if (k == -1 || t[j] == t[k]){ j++; k++; next[j] = k; }else{ k = next[k]; } } //2.  Start matching. You can see that the matching process code is basically the same as the BP algorithm, but the processing is different where the matching error occurs for (int i = 0,j=0;i<s.length && j < j.length;){ If (j==- 1 | | s [i]==t [j]) {//The current match succeeds or fails and j=(next [j]==- 1) j++; i++; }Else {//The position matching fails. I will not move, and j will move to the next [j] position to continue comparing with s [i] j = next[j]; } } //After finishing the cycle, judge whether the matching is successful If (j>=t.length) {//indicates that j has accessed the entire t substring, which means that the matching is successful Return i - j - 1;//Returns the location of the first successful match }else{ return -1; } }

Note: The second step above is the writing method in the textbook, because the two cases of successful or failed matching and next [j]==- 1 are written together i++;j++ But it is not easy to understand. It can be changed to the following way, which is easier to understand. The execution efficiency is the same, and the essence is the same

 //2.  Start matching. You can see that the matching process code is basically the same as the BP algorithm, but the processing is different where the matching error occurs for (int i = 0,j=0;i<s.length && j < j.length;){ If (s [i]==t [j]) {//The current match is successful j++; i++; }Else {//The location matching failed, I will not move if(next[j] != - 1) {//j Move to the next [j] position to continue comparing with s [i] j = next[j];  }Else {//If next [j]==- 1, it means that the position where the matching failed is 0 of the substring, so I needs to move forward one bit, and j remains unchanged at 0 J=0;//This sentence can be removed because j must be 0 at this time i ++; } } } //After finishing the cycle, judge whether the matching is successful If (j>=t.length) {//indicates that j has accessed the entire t substring, which means that the matching is successful Return i - j - 1;//Returns the location of the first successful match }else{ return -1; }

The problem of KMP algorithm is that it ignores a special pattern substring, such as aaaaa , this seed string can also use its own Special values equal Property further simplifies the process of detection and matching.

We return to the KMP algorithm. When the substring fails to match at position j, (break=j, we use the break temporary variable to save the location of the matching failure) j=next [j], and then judge whether s [i] is equal to t [j].

We used to use Similarity+sub string is equal to main string , here we can also dig The special value is equal+j (j before the move is break). The position is not equal to the corresponding position i of the main string 。

Because t [break]= s[i]， If t [j]=t [break] (where j is the moved j) , then t [j] must not be equal to s [i], so we can skip the step of comparing s [i] with t [j], and directly move j to next [j] (j=next [j]) again, and then compare s [i] with t [j].

summary : If the substring fails to match at position j, and t [next [j]]=t [j], then s [i] can be directly compared with t [next [j]].

As shown in the figure above, s [i] can be directly compared with t [next [j-i]].

The reason why the KMP algorithm needs to be improved is mentioned above. The specific improvement methods are shown below.

Improved KMP algorithm

According to the above analysis, we only need to calculate the next array Consider whether t [j] and t [next [j]] are equal ：

If equal, next [j]=next [next [j]];
If they are not equal, the next [j] value is still the previous calculation method (the length of the similar region of the small string)

To distinguish from the previous algorithm, we rename the next array to the nextval array.

in fact, The nextval array calculation includes the next array calculation , so when we calculate manually, we first calculate the next array, and then calculate the nextval array. The process is as follows:

Compare t [j] and t [next [j]] for equality:

If equal, next [j]=nextval [next [j]];

Note: (The next [j] value must be less than j, so the nextval [next [j]] value has been calculated before, because the length of the similar area of the substring must be less than the length of the substring (because it must be offset by a certain distance))

If not, nextval [j]=next [j]

This process is different from the above process:

The first method is that only nextval is an array. Although the length of the similar region of the substring will be calculated during the calculation, it is applicable to the algorithm code

The second method is to explicitly calculate the next array first and then the nextval array, which is convenient for manual calculation.

Code of KMP improved algorithm:

 int KMPindex(String s,String t){ //1.  Find the nextval array of substrings int [] nextval = new int[MaxSize]; nextval[0] = -1;// Fixed value of the previous digit int k = -1; for(int j= 0; J<t.length;) {//Find the length of the similar region of the small string for each bit //Same as KMP, it only requires manual calculation results, and the code has not been studied in depth } //2.  Start matching. You can see that the matching process code is basically the same as the BP algorithm, but the processing is different where the matching error occurs for (int i = 0,j=0;i<s.length && j < j.length;){ If (s [i]==t [j]) {//The current match is successful j++; i++; }Else {//The location matching failed, I will not move if(next[j] != - 1) {//j Move to the position of nextval [j] to continue comparing with s [i] j = nextval[j];  }Else {//If nextval [j]==- 1, it means that j will eventually move to position 0, and t [j]==t [0], so J=0;//This sentence must not be removed!! Because not only 0 position nextval [0]=- 1, but also other positions may be equal to - 1 i ++; } } } //After finishing the cycle, judge whether the matching is successful If (j>=t.length) {//indicates that j has accessed the entire t substring, which means that the matching is successful Return i - j - 1;//Returns the location of the first successful match }else{ return -1; } }

last

About the above personal experience, there are the following explanations:

The code is manually typed in the editor, and has not been run by the code. It is likely that there is an error in the operation, and it is only used to show the algorithm ideas
It can be seen that the KMP (and improved) algorithm is essentially dealing with the problem of how to move the j of the substring when the position matching of the substring j fails. So the meaning of the next (nextval) array is that if the position matching fails, where should the current position be moved
KMP algorithm and BP algorithm are completely consistent in the process of successful matching
Understand that there may be mistakes, please correct them

Last modification: January 30, 2023

Do you like my article?
Don't forget to praise or appreciate, let me know that you accompany me on the way of creation.

12 comments

Test it
November 27, 2018

Ha ha ha

reply
Oz said money
November 25, 2018

Understand| ´・ω・)ノ

reply
jiuchi
November 21, 2018

Go for postgraduate entrance examination!

reply
1. Friend C
  November 22, 2018
  
  @jiuchi
  
  I will (3591ᵒ ̌) 3591⁼ ₌
  
  reply
A passing canon
November 18, 2018

What editor does the blogger use? Can you upload screenshots locally at any time?

reply
Who knows
November 17, 2018

After learning the compilation principle, the blogger should easily understand the principle of the kmp algorithm. In fact, it will be clearer to explain kmp from the perspective of automata.

reply
Mandible
November 16, 2018

It's too difficult to understand the numbness of the scalp

reply
Changyang years
November 3, 2018

Thanks for sharing

reply
mofanrs
October 27, 2018

Hello, the author. Is your theme synchronized to the bear's paw function O ω O

reply
1. Xia Mu Guizhi
  October 28, 2018
  
  @mofanrs
  
  I remember that the amp plug-in with plug-in seems to be
  
  reply
2. Friend C
  October 28, 2018
  
  @mofanrs
  
  Not yet
  
  reply
IamMicroCoder
October 20, 2018

Boss

reply

Comment Cancel Reply
Use cookie technology to keep your personal information for your next quick comment. Continuing to comment means that you have agreed to the terms

comment *

Private comments

name *

🎲

mailbox *

address

Handsome -- a typecho theme
Comments: one thousand eight hundred and sixty-four
Focus -- not just RSS subscribers
Comments: one hundred and sixteen
My Personal Experience in the Postgraduate Entrance Examination of Beijing Post in 2019
Comments: one hundred and six
Goodbye, 2016
Comments: ninety-six
Leaf — A Typecho Theme
Comments: ninety-four

Pang Ge
Smoke test Do you have a development and self-test team or tools to smoke
SolitudeAlma
Agr reader is also very good
ink painting of bamboo
Unfortunately, there are many styles like sns or circles of friends. If you can achieve this
ink painting of bamboo
My description is not very accurate. I don't mean that the columns in the navigation are separated from the navigation, similar to
Inheritance
Maybe it's a problem in the urban area. I lived in several houses in Changping, the one bedroom type. Basically

Understanding of KMP algorithm

Friend C • October 10, 2018

<trans data-src="关于字符串的子串模式匹配算法，最经典最简单的的算法是BP算法（Brude-Force）。</">For thesubstring pattern matchingalgorithm of strings, the most classic and simple algorithm is the BP algorithm (Bruce Force)</</trans> <trans data-src="p><h2>BP 算法</h2>首先我们需要定义一些内容为了下面的解释：<ul><li>主串 S 和 子串 T ，且 S.length &gt; ">p> <h2>BP algorithm</h2>First, we need to define some contents for the following explanation:<ul><li>main string S and sub string T, and S.length&gt;</trans> <trans data-src="T.length</li><li>串的物理位置从0开始</li></ul>基本思路是：主串从第一个字符与子串的第一个字符进行匹配，一直进行对等位置比较，会有两种情况：<ul><li>从初始位置开始与子串完全匹配 -&gt; ">T. Length</li><li>The physical position of a string starts from 0</li></ul>The basic idea is: the main string matches the first character of a substring with the first character of a substring from, and always carries out peer position comparison. There are two cases:<ul><li>from the initial position, the main string matches the substring completely -&gt;</trans> <trans data-src="匹配成功</li><li>在 主串在 i 的位置上匹配失败（此时 子串位置为 j） -&gt; <">Matching succeeded</li><li>The matching failed at the position of the main string at i (the position of the substring is j at this time) -&gt<</trans> <trans data-src="strong>主串从第二个位置与子串的第一个字符重新开始匹配</li></ul>注：当匹配失败时，需要返回到主串的一开始的位置，并前进一位（i - j + 1）。</">Strong>The main string starts matching again from the second position with the first character of the sub string</li></ul>Note: When the matching fails, you need toreturn to the position at the beginning of the main string and advance one bit (i - j+1)</</trans> <trans data-src="p>如下图所示：<img src="">p> As shown in the figure below:<img src=“</trans> <trans data-src="https://www.ihewro.com/usr/uploads/sina/5cc07f0e073ef.jpg">https://www.ihewro.com/usr/uploads/sina/5cc07f0e073ef.jpg</trans> <trans data-src="" alt="" title="" style="">注：绿色区域为匹配成功（长度为j）红色区域表示匹配失败处 （在 i 的位置） 。<img src="">"Alt=" "title=" "style=" ">Note: The green area indicates a successful match (the length is j), and the red area indicates a failed match (at the position of i).<img src="</trans> <trans data-src="https://www.ihewro.com/usr/uploads/sina/5cc07f0e13fec.jpg">https://www.ihewro.com/usr/uploads/sina/5cc07f0e13fec.jpg</trans> <trans data-src="" alt="" title="" style="">注： 表示成功匹配的结果这种算法的代码是：<pre><code class="lang-c">int index(String s, String t){">"Alt=" "title=" "style=" ">Note: indicates the result of successful matchingThe code of this algorithm is:<pre><code class=" lang-c ">int index (String s, String t){</trans> <trans data-src="for (int i = 0,j=0;i&lt;s.length &amp;&amp; j &lt; j.length;){">for (int i = 0,j=0;i&lt;s.length &amp;&amp; j &lt; j.length;){</trans> <trans data-src="if (s[i] == t[j]){//目前匹配成功">If (s [i]==t [j]) {//The current match is successful</trans> <trans data-src="j++;">j++;</trans> <trans data-src="i++;">i++;</trans> <trans data-src="}else{//该位置匹配失败，i需要返回到一开始的位置并加1重新检测匹配">}Else {//The location matching failed. I need to return to the original location and add 1 to detect the matching again</trans> <trans data-src="i = i - j + 1;">i = i - j + 1;</trans> <trans data-src="j = 0;">j = 0;</trans> <trans data-src="}">}</trans> <trans data-src="}">}</trans> <trans data-src="//结束循环后，判断是否匹配成功">//After finishing the cycle, judge whether the matching is successful</trans> <trans data-src="if (j &gt;= ">if (j &gt;= </trans> <trans data-src="t.length){//表示j已经访问了整个t子串了，即匹配成功">t. Length) {//indicates that j has accessed the entire t substring, which means the matching is successful</trans> <trans data-src="return i - j - 1; // 返回一开始成功匹配的位置">Return i - j - 1;//Returns the location of the first successful match</trans> <trans data-src="}else{">}else{</trans> <trans data-src="return -1;">return -1;</trans> <trans data-src="}">}</trans> <trans data-src="}</code></pre>思考：这样的算法很符合我们的正常思维，但是有什么问题呢？</">}</code></pre>Thinking: This algorithm is very consistent with our normal thinking, but what is the problem</</trans> <trans data-src="p>问题在于某些串本身会有"对称性（相似性）"，我们应该利用这个性质（串的内在信息）。</">p> The problem is that some stringshave "symmetry (similarity)". We should use this property (the internal information of strings)</</trans> <trans data-src="p><ul><li>这里的“对称性（相似性）”并不是一般的对称（一般的对称如某个图像关于X轴对称，这样沿着X轴折叠，两边可以重叠）。">p> <ul><li>The "symmetry (similarity)" here is not a general symmetry (a general symmetry is that an image is symmetric about the X axis, so it folds along the X axis, and both sides can overlap).</trans> <trans data-src="如下图所示：</li></ul><img src="/var/folders/29/_b_gwrjn2d94b_fjbyt2v1h00000gn/T/abnerworks.Typora/image-20181010104903162.png" alt="image-20181010104903162" title="image-20181010104903162" style="">这里对称性（相似性），如上图，红色区域与绿色区域是“对称的（相似的）">As shown in the figure below:</li></ul><img src="/var/folders/29/_b_gwrjn2d94b_fjbyt2v1h00000gn/T/abnerworks. Typora/image-20181010104903162. png" alt="image-20181010104903162" title="image-20181010104903162" style="">Here is the symmetry (similarity), as shown in the figure above, the red area and the green area are "symmetric (similar)</trans> <trans data-src="，只是位置偏移一段距离。</">, butthe position is offset by a certain distance</</trans> <trans data-src="em><ul><li>举一些例子： 比如 子串 <code>abab</code>，相似的区域是 <code>ab</code> 。</">Em><ul><li>Give some examples: for example, substring<code>abab</code>, the similar area is<code>ab</code></</trans> <trans data-src="li></ul>关于子串的对称性有几点说明：<ol><li>大部分的串或多或少都有一些相似性，如果没有任何“相似性”，">Li></ul>There are several explanations about the symmetry of substrings:<ol><li>Most strings have some similarity more or less.If there is no "similarity",</trans> <trans data-src="BP算法仍然有优化空间（根据子串的不相似性减少不必要的检查，这个KMP部分会解释）</li><li>子串的两个相似区域不能完全重叠，必须偏移一定的距离。">BP algorithm still has optimization space (reduce unnecessary checks according to the non similarity of substrings, and this KMP part will explain)</li><li>The two similar areas of substrings cannot overlap completely, and must be offset by a certain distance.</trans> <trans data-src="比如子串 <code>aaaa</code>，相似区域是<code>aaa</code>（相似区域不是<code>aaaa</code>，因为不能完全重叠）。</">For example, if the substring<code>aaaa</code>, the similar area is<code>aaa</code>(the similar area is not<code>aaaa</code>, because it cannot overlap completely)</</trans> <trans data-src="li></ol>KMP算法的核心就是：<ol><li>寻找模式子串每个字符前面的"小子串"的相似区域的长度。</">Li></ol>The core of KMP algorithm is:<ol><li>to find the length of the similar area of the "substring" before each character of the pattern substring</</trans> <trans data-src="strong></li><li>利用这个相似区域减少不必要的检查匹配。</">Strong></li><li>Use this similar area to reduce unnecessary check matching</</trans> <trans data-src="li></ol>具体KMP原理看下面的解释。</">Li></ol>See the following explanation for specific KMP principles</</trans> <trans data-src="p><h2>KMP 算法</h2>KMP算法能够优化寻找过程的依赖的性质有两点：<ol><li>子串匹配失败处前面的“小子串”与上面的主串对应部分是完全相等的</li><li>“小子串”具有“相似性”，">p> <h2>KMP algorithm</h2>The KMP algorithm can optimize the dependency of the search process. There are two properties:<ol><li>The "substring" in front of thesubstring matching failureis completely equal to the corresponding part of the main string</li><li>"substring",</trans> <trans data-src="即有相似区域</li></ol>注：我们将子串匹配失败处前面的部分子串成为“小子串”，下同如下图所示：<img src="">That is, there are similar areas</li></ol>Note: we will make some substrings in front of the substring matching failure as "substrings", the same belowas shown in the following figure:<img src=“</trans> <trans data-src="https://www.ihewro.com/usr/uploads/sina/5cc07f0e22e51.jpg">https://www.ihewro.com/usr/uploads/sina/5cc07f0e22e51.jpg</trans> <trans data-src="" alt="" title="" style="">注：该图与BP算法的图是对应的，绿色是已匹配成功区域，红色是匹配失败的断点处。<hr>我们假设子串t[j] 子串前面的“小子串“（即绿色匹配成功区域）">"Alt=" "title=" "style="</trans> <trans data-src="具有相似区域：且区域1 = 区域2，如下图所示，子串中具有两个相似区域：<img src="">There are similar areas:andArea 1=Area 2, as shown in the following figure, there are two similar areas in a substring:<img src=“</trans> <trans data-src="https://www.ihewro.com/usr/uploads/sina/5cc07f0e32a4d.jpg">https://www.ihewro.com/usr/uploads/sina/5cc07f0e32a4d.jpg</trans> <trans data-src="" alt="" title="" style="">如果按照BP算法，主串应该从<code>i-j+1</code> 位置开始与子串的第一个字符进行匹配。即判断 <code>区域3=?区域1</code>。这里<code>区域2=区域3</code>（因为这是绿色区域，是之前就已经匹配成功的）">"Alt=" "title=" "style=" ">If the BP algorithm is followed, the main string should match the first character of the substring from the position of<code>i-j+1</code>. That is, it is judged that<code>area 3=? Area 1</code>.Here<code>area 2=area 3</code>(because this is a green area, it has been successfully matched before)</trans> <trans data-src="，而<code>区域1 = 区域2</code>(假设的相似区域)，所以必然<code>区域1 = 区域3</code>。</">, and<code>area 1=area 2</code>(assumed similar area), so it must be<code>area 1=area 3</code></</trans> <trans data-src="p>所以可以直接跳过区域3与区域1的比较，主串的 i 指针不需要返回到 i - j + 1,直接s[i] 与 t[j-1] 比较。</">p> So you can directly skip the comparison between area 3 and area 1. The i pointer of the main string does not need to return to i - j+1, but directly compares s [i] with t [j-1]</</trans> <trans data-src="p>这是一种最简单的情况，相似区域的长度是“小子串”的长度减1（j-1）。</">p> This is the simplest case. The length of the similar region is the length of the "small string" minus 1 (j-1)</</trans> <trans data-src="p><hr>实际上相似区域可能并没那么长（第一种情况那么好）如果<code>区域1 != ">p> <hr>In fact, the similar area may not be so long (the first case is so good)If<code>area 1=</trans> <trans data-src="区域2</code> 。">Zone 2</code>.</trans> <trans data-src="那么根据上面的推导，则<code>区域3 != ">According to the above deduction, then<code>area 3=</trans> <trans data-src="区域1</code>。</">Zone 1</code></</trans> <trans data-src="p>那么BP算法就没必要再将区域3与区域1进行比较，而是直接s[i-j+2] 与 t[0] 比较，如下图的过程所示：<img src="">p> Then the BP algorithm does not need to compare area 3 with area 1, but directly compares s [i-j+2] with t [0], as shown in the following process:<img src=“</trans> <trans data-src="https://www.ihewro.com/usr/uploads/sina/5cc07f0e43036.jpg">https://www.ihewro.com/usr/uploads/sina/5cc07f0e43036.jpg</trans> <trans data-src="" alt="" title="" style="">我们假设小子串相似区域缩小为现在的<code>区域11 与 区域22</code>（长度为j-2）。因为<code>区域11 == 区域 22</code>，则<code>区域11 ==区域33</code>（<code>区域22 一定 == 区域33</code>）">"Alt=" "title=" "style=" ">Weassume thatthe similar area of the small string is reduced to the current<code>area 11 and area 22</code>(the length is j-2).Because<code>area 11==area 22</code>,<code>area 11==area 33</code>(<code>area 22 must=area 33</code>)</trans> <trans data-src="则主串的i指针仍然不用返回，直接将S[i] 与 S[j-2] 比较即可。</">Thenthe i pointer of the main string does not need to be returned. You can directly compare S [i] with S [j-2]</</trans> <trans data-src="p><hr>按照上面的过程，假设小子串相似区域缩小到 j - 3,那么主串的i指针仍然不用返回，直接将S[i] 与 S[j-3] 比较即可。</">p> <hr>In accordance with the above process,assume thatthe similar area of the small string is reduced to j - 3, then the i pointer of the main string still does not need to be returned, and you can directly compare S [i] with S [j-3]</</trans> <trans data-src="p>……假设小子串没有任何的相似区域，那么主串的i指针仍然不用返回，直接将S[i] 与 S[0] 比较即可。</">p> ......Assuming that thesubstring does not have any similar regions, thei pointer of the main string still does not need to be returned. You can directly compare S [i] with S [0]</</trans> <trans data-src="p>这种最差的情况（小子串没有相似性）相比较BP算法，我们仍然减少了一系列的检查匹配（如<code>S[i-j+1] ?= T[0]</code> <code>S[j-j+2] = ? ">p> In this worst case (small string has no similarity), compared with BP algorithm, we still reduce a series of check matching (such as<code>S [i-j+1]?=T [0]</code><code>S [j-j+2]=?</trans> <trans data-src="T[0]</code>）<hr>上面的分析过程，我们知道KMP相比较BP算法关键有两点不同：<ol><li>主串的 i 指针不需要返回</li><li>s[i] 与 t[k] 直接比较，这里的k是小子串相似区域的长度。<">T [0]</code>)<hr><p<</trans> <trans data-src="strong>这个k的意义是，当子串j位置匹配失败，下一步子串j指针需要移动到k的位置上（j = k）然后s[i]与t[k]开始检测匹配。</">Strong>The meaning of this k is that when the sub string j position matching fails, the next step the sub string j pointer needs to move to the position of k(j=k), and then s [i] and t [k] start to detect the matching</</trans> <trans data-src="li></ol>问题的关键是对于模式子串的每一个位置j，我们怎么求出对应的k。</">Li></ol>The key to the problem is how to find the corresponding kfor each position j of the pattern substring</</trans> <trans data-src="p>即对模式子串每个字符（位置为j），前面的小子串都有一个相似区域的长度（长度为k），我们把这个长度记录在next[j] = k实际的KMP算法是一开始就求出模式子串的所有next数组的值，然后再进行匹配检测，">p> That is, for each character (position j) of the pattern substring, the preceding substring has a length of similar area (length k). We record this length in next [j]=kThe actual KMP algorithm is to calculate the values of all next arrays of pattern substrings at the beginning, and then perform matching detection,</trans> <trans data-src="即分为两步：<ol><li>获取到模式子串的next数组</li><li>当匹配失败时，主串i指针不需要返回，模式子串j 移动到next[j]的位置，继续进行匹配检测，直到主串检测完毕（i&gt;= s.length）</li></ol><hr>关于模式子串的next数组求法，">That is, there are two steps:<ol><li>Get the next array of pattern substrings</li><li>When the matching fails, the pointer of main string i does not need to return, the pattern substring j moves to the next [j] position, and the matching detection continues until the main string detection is complete (i&gt;=s.length)</li></ol><hr>Next array calculation of pattern substrings,</trans> <trans data-src="根据上面分析很容易理解：<ol><li>（固定前两位的值）next[0] = -1;">According to the above analysis, it is easy to understand:<ol><li>(fix the first two values) next [0]=- 1;</trans> <trans data-src="next [1] = 0;</">next [1] = 0;</</trans> <trans data-src="li><li>然后从j=2开始，求出该位置前面的小子串的相似区域的长度即可。</">Li><li>Then, starting from j=2, we can calculate the length of the similar region of the small string in front of this position</</trans> <trans data-src="li></ol>注：这里说明一下，next[0] = -1 意思是，当主串s[i] 与子串0位置比较不匹配的时候，下一步是将主串i指针前移一位继续与子串0位置比较。">Li></ol>Note:Here, next [0]=- 1 means that when the main string s [i] does not match the substring 0 position, the next step is to move the main string i pointer forward one bit to continue to compare with the substring 0 position.</trans> <trans data-src="换成代码的写法即：<pre><code class="lang-c">if (j == -1){// 前面已经做了j = next[j]，具体看下面完整代码">The code writing method is:<pre><code class="lang-c">if (j==- 1) {//j=next [j] has been done previously. See the following complete code for details</trans> <trans data-src="i++;// 主串 i 指针前移一位">I++;//The pointer of main string i moves forward one bit</trans> <trans data-src="j++;// j 变成0，即与t[0]进行匹配检测">J++;//j becomes 0, that is, matching detection with t [0]</trans> <trans data-src="}</code></pre>下面给出KMP算法的代码：<pre><code class="lang-c">int KMPindex(String s,String t){">}</code></pre><code class="lang-c">int KMPindex (String s, String t){</trans> <trans data-src="//1. ">//1. </trans> <trans data-src="求出子串的next数组">Find the next array of substrings</trans> <trans data-src="int [] next = new int[MaxSize];">int [] next = new int[MaxSize];</trans> <trans data-src="next[0] = -1;">next[0] = -1;</trans> <trans data-src="next[1] = 0;//">next[1] = 0;//</trans> <trans data-src="固定的前两位的值">Fixed value of the first two digits</trans> <trans data-src="int k = -1;">int k = -1;</trans> <trans data-src="for(int j= 0;">for(int j= 0;</trans> <trans data-src="j&lt;t.length;){//对每一位找到小子串的相似区域的长度">J&lt; t.length;) {//Find the length of the similar region of the small string for each bit</trans> <trans data-src="// 手算next数组很简单的，但是用代码则不容易。">//It is easy to calculate the next array manually, but it is not easy to use code.</trans> <trans data-src="李春葆《数据结构》上的代码非常精妙，仅用4行">The code on Li Chunbao's Data Structure is very exquisite, only using 4 lines</trans> <trans data-src="// 如果按照课本上的代码过程发现确实能得到正确结果，但是自己根本想不到这么写啊">//If you follow the code process in the textbook and find that you can get the correct results, you can't think of writing it like this</trans> <trans data-src="// 考研基本不考这部分代码，我先不深入研究了，会手算即可">//I don't want to take this part of the code for the postgraduate entrance exam. I won't go into it first, but I can do it by hand</trans> <trans data-src="if (k == -1 || t[j] == t[k]){">if (k == -1 || t[j] == t[k]){</trans> <trans data-src="j++;">j++;</trans> <trans data-src="k++;">k++;</trans> <trans data-src="next[j] = k;">next[j] = k;</trans> <trans data-src="}else{">}else{</trans> <trans data-src="k = next[k];">k = next[k];</trans> <trans data-src="}">}</trans> <trans data-src="}">}</trans> <trans data-src="//2. ">//2. </trans> <trans data-src="开始匹配,可以看到匹配过程代码和BP算法基本一样，只是在匹配出错的地方处理不一样">Start matching. You can see that the matching process code is basically the same as the BP algorithm, but the processing is different where the matching error occurs</trans> <trans data-src="for (int i = 0,j=0;i&lt;s.length &amp;&amp; j &lt; j.length;){">for (int i = 0,j=0;i&lt;s.length &amp;&amp; j &lt; j.length;){</trans> <trans data-src="if (j == -1 || s[i] == t[j]){//目前匹配成功 或者 匹配失败且j = (next[j] == -1)">If (j==- 1 | | s [i]==t [j]) {//The current match succeeds or fails and j=(next [j]==- 1)</trans> <trans data-src="j++;">j++;</trans> <trans data-src="i++;">i++;</trans> <trans data-src="}else{//该位置匹配失败，i不动，j移动到next[j]的位置继续与s[i]比较">}Else {//The position matching fails. I will not move, and j will move to the next [j] position to continue comparing with s [i]</trans> <trans data-src="j = next[j];">j = next[j];</trans> <trans data-src="}">}</trans> <trans data-src="}">}</trans> <trans data-src="//结束循环后，判断是否匹配成功">//After finishing the cycle, judge whether the matching is successful</trans> <trans data-src="if (j &gt;= ">if (j &gt;= </trans> <trans data-src="t.length){//表示j已经访问了整个t子串了，即匹配成功">t. Length) {//indicates that j has accessed the entire t substring, which means the matching is successful</trans> <trans data-src="return i - j - 1; // 返回一开始成功匹配的位置">Return i - j - 1;//Returns the location of the first successful match</trans> <trans data-src="}else{">}else{</trans> <trans data-src="return -1;">return -1;</trans> <trans data-src="}">}</trans> <trans data-src="}</code></pre>注：上面的第二步开始匹配是课本上的写法，因为匹配成功或者匹配失败且next[j] == -1 两种情况写在一起了，都是<code>i++;">}</code></pre>Note: The second step above is the writing method in the textbook, because the two cases of successful or failed matching and next [j]==- 1 are written together, both are<code>i++;</trans> <trans data-src="j++</code>但是不好理解，可以改成下面我们比较容易理解的方式，执行效率一样，本质也是相同的<pre><code class="lang-c">//2. ">J++</code>is not easy to understand. It can be changed to the following way that we can easily understand. The execution efficiency is the same and the essence is the same<pre><code class="lang-c">//2</trans> <trans data-src="开始匹配,可以看到匹配过程代码和BP算法基本一样，只是在匹配出错的地方处理不一样">Start matching. You can see that the matching process code is basically the same as the BP algorithm, but the processing is different where the matching error occurs</trans> <trans data-src="for (int i = 0,j=0;i&lt;s.length &amp;&amp; j &lt; j.length;){">for (int i = 0,j=0;i&lt;s.length &amp;&amp; j &lt; j.length;){</trans> <trans data-src="if (s[i] == t[j]){//目前匹配成功">If (s [i]==t [j]) {//The current match is successful</trans> <trans data-src="j++;">j++;</trans> <trans data-src="i++;">i++;</trans> <trans data-src="}else{//该位置匹配失败，i不动">}Else {//The location matching failed, I will not move</trans> <trans data-src="if(next[j] != -">if(next[j] != -</trans> <trans data-src="1){//j移动到next[j]的位置继续与s[i]比较">1) {//j Move to the next [j] position to continue comparing with s [i]</trans> <trans data-src="j = next[j]; ">j = next[j]; </trans> <trans data-src="}else{//如果next[j] == -1，表示该匹配失败处是子串的0位置，所以i需要前移一位，j仍然保持不变为0">}Else {//If next [j]==- 1, it means that the position where the matching failed is 0 of the substring, so I needs to move forward one bit, and j remains unchanged at 0</trans> <trans data-src="j = 0;//这句话可以去掉，因为此时j一定为0">J=0;//This sentence can be removed because j must be 0 at this time</trans> <trans data-src="i ++;">i ++;</trans> <trans data-src="}">}</trans> <trans data-src="}">}</trans> <trans data-src="}">}</trans> <trans data-src="//结束循环后，判断是否匹配成功">//After finishing the cycle, judge whether the matching is successful</trans> <trans data-src="if (j &gt;= ">if (j &gt;= </trans> <trans data-src="t.length){//表示j已经访问了整个t子串了，即匹配成功">t. Length) {//indicates that j has accessed the entire t substring, which means the matching is successful</trans> <trans data-src="return i - j - 1; // 返回一开始成功匹配的位置">Return i - j - 1;//Returns the location of the first successful match</trans> <trans data-src="}else{">}else{</trans> <trans data-src="return -1;">return -1;</trans> <trans data-src="}</code></pre><hr>KMP算法的问题在于忽视了一种特殊的模式子串，如<code>aaaaa</code>，这种子串还能够利用本身的特殊值相等性质进一步简化检测匹配的过程。</">}The problem of</code></pre><hr>KMP algorithm is that it ignores a special pattern substring, such as<code>aaaaa</code>. This substring can also further simplify the detection and matching process by using its ownspecial value equalityproperty</</trans> <trans data-src="p>我们重新回到KMP算法中，当子串在j位置上匹配失败，（break = j 我们用break 临时变量保存匹配失败的位置）j = next[j] 然后判断s[i] 是否与 t[j] 相等。</">p> We return to the KMP algorithm. When the substring fails to match at position j, (break=j, we use the break temporary variable to save the location of the matching failure) j=next [j], and then judge whether s [i] is equal to t [j]</</trans> <trans data-src="p>我们之前利用的是相似性 + 小子串与主串对应相等，这里我们还可以挖掘特殊值相等 + j(移动前的j即break) 位置与主串的对应位置i不相等。</">p> We usedsimilarity+the corresponding equality between the sub string and the main string. Here we can also mine thespecial value equality+j (the j before the move is the break) position is not equal to the corresponding position i of the main string</</trans> <trans data-src="p>因为t[break] != ">p> Because t [break]=</trans> <trans data-src="s[i]，如果 t[j] = t[break] (这里的j是移动后的j了)，那么t[j] 也一定不等于s[i]，则我们可以跳过s[i] 与t[j] 比较这一步，而直接j再次移动到next[j] (j = next[j])，然后再s[i] 与 t[j] 比较。</">S [i],If t [j]=t [break] (where j is the moved j), then t [j] must not be equal to s [i], then we can skip the step of comparing s [i] with t [j], and directly move j to next [j] (j=next [j]), and then compare s [i] with t [j]</</trans> <trans data-src="p>总结：当子串在j位置匹配失败，且t[next[j]] = t[j] ，那么可以直接将s[i] 与t[next[next[j]]] 进行比较。</">p> Summary: When a substring fails to match at position j, and t [next [j]]=t [j], then s [i] can be directly compared with t [next [j]]</</trans> <trans data-src="p><img src="">p><img src="</trans> <trans data-src="https://www.ihewro.com/usr/uploads/sina/5cc07f0e4f602.jpg">https://www.ihewro.com/usr/uploads/sina/5cc07f0e4f602.jpg</trans> <trans data-src="" alt="" title="" style="">如上图所示，即可以直接将s[i] 与t[next[j-i]] 进行比较。上面讲的是KMP算法为什么需要改进，具体改进方法见下面。<h2>KMP 改进算法</h2>根据上面的分析，其实我们只需要在计算next数组的时候考虑到t[j] 与 t[next[j]">"Alt=" "title=" "style=" ">As shown in the figure above, we can directly compare s [i] with t [next [j-i]].The reason why the KMP algorithm needs to be improved is described above, and the specific improvement methods are shown below.<h2>Improved KMP algorithm</h2>According to the above analysis, we only need toconsider t [j] and t [next [j] when calculating the next array</trans> <trans data-src="] 是否相等：<ul><li>如果相等，next[j] =next[next[j]];</">]Equal:<ul><li>If equal, next [j]=next [j]]</</trans> <trans data-src="li><li>如果不相等话，next[j] 值还是以前的计算方法（小子串相似区域的长度）</li></ul>为了以便和上个算法区分，我们将这个next数组改名为nextval数组。</">Li><li>If they are not equal, the next [j] value is still the previous calculation method (the length of the similar region of the substring)</li></ul>In order to distinguish from the previous algorithm, we rename the next array as the nextval array</</trans> <trans data-src="p>事实上，nextval数组计算过程中是包含next数组的计算的，所以我们手算的时候都是先算next数组，再算nextval数组的，过程如下：比较t[j]与t[next[j]]是否相等：<ul><li>如果相等，next[j] =nextval[next[j]];</">p> In fact, thenextval array calculation process includes the calculation of the next array, so when we manually calculate, we first calculate the next array, and then calculate the nextval array. The process is as follows:Compare whether t [j] is equal to t [next [j]]:<ul><li>If equal, next [j]=nextval [next [j]]</</trans> <trans data-src="li></ul>注：(next[j] 值肯定是小于j，所以nextval[next[j]]值之前已经计算出来了，这是因为小子串的相似区域的长度肯定小于小子串的长度（因为必须要偏移一定距离）)<ul><li>如果不相等话，">Li></ul>Note: (The next [j] value must be less than j, so the nextval [next [j]] value has been calculated before, because the length of the similar area of the small string must be less than the length of the small string (because it must be offset by a certain distance))<ul><li>If not equal,</trans> <trans data-src="nextval[j] = next[j]</li></ul>这种过程和上面的过程是有一定区别的：第一种方法是只有nextval一个数组，虽然计算过程中会计算小子串的相似区域长度，适用于算法代码第二种方法是明确先求next数组，再求nextval数组，方便手算的。</">Nextval [j]=next [j]</li></ul>This process is different from the above process:The first method is that only nextval is an array. Although the length of the similar area of the substring will be calculated during the calculation process, it is applicable to the algorithm code</</trans> <trans data-src="p>KMP改进算法的代码：<pre><code class="lang-c">int KMPindex(String s,String t){">p> KMP improved algorithm code:<pre><code class="lang-c">int KMPindex (String s, String t){</trans> <trans data-src="//1. ">//1. </trans> <trans data-src="求出子串的nextval数组">Find the nextval array of substrings</trans> <trans data-src="int [] nextval = new int[MaxSize];">int [] nextval = new int[MaxSize];</trans> <trans data-src="nextval[0] = -1;//">nextval[0] = -1;//</trans> <trans data-src="固定的前一位的值">Fixed value of the previous digit</trans> <trans data-src="int k = -1;">int k = -1;</trans> <trans data-src="for(int j= 0;">for(int j= 0;</trans> <trans data-src="j&lt;t.length;){//对每一位找到小子串的相似区域的长度">J&lt; t.length;) {//Find the length of the similar region of the small string for each bit</trans> <trans data-src="//同KMP,只要求手算结果即可，代码没有深入研究">//Same as KMP, it only requires manual calculation results, and the code has not been studied in depth</trans> <trans data-src="}">}</trans> <trans data-src="//2. ">//2. </trans> <trans data-src="开始匹配,可以看到匹配过程代码和BP算法基本一样，只是在匹配出错的地方处理不一样">Start matching. You can see that the matching process code is basically the same as the BP algorithm, but the processing is different where the matching error occurs</trans> <trans data-src="for (int i = 0,j=0;i&lt;s.length &amp;&amp; j &lt; j.length;){">for (int i = 0,j=0;i&lt;s.length &amp;&amp; j &lt; j.length;){</trans> <trans data-src="if (s[i] == t[j]){//目前匹配成功">If (s [i]==t [j]) {//The current match is successful</trans> <trans data-src="j++;">j++;</trans> <trans data-src="i++;">i++;</trans> <trans data-src="}else{//该位置匹配失败，i不动">}Else {//The location matching failed, I will not move</trans> <trans data-src="if(next[j] != -">if(next[j] != -</trans> <trans data-src="1){//j移动到nextval[j]的位置继续与s[i]比较">1) {//j Move to the position of nextval [j] to continue comparing with s [i]</trans> <trans data-src="j = nextval[j]; ">j = nextval[j]; </trans> <trans data-src="}else{//如果nextval[j] == -1，表示j最终会移动到0位置上，且 t[j] == t[0]，所以">}Else {//If nextval [j]==- 1, it means that j will eventually move to position 0, and t [j]==t [0], so</trans> <trans data-src="j = 0;//这句话一定不能去掉！！">J=0;//This sentence must not be removed!!</trans> <trans data-src="因为不仅只有0位置nextval[0] = -1还有别的位置也可能等于-1">Because not only 0 position nextval [0]=- 1, but also other positions may be equal to - 1</trans> <trans data-src="i ++;">i ++;</trans> <trans data-src="}">}</trans> <trans data-src="}">}</trans> <trans data-src="}">}</trans> <trans data-src="//结束循环后，判断是否匹配成功">//After finishing the cycle, judge whether the matching is successful</trans> <trans data-src="if (j &gt;= ">if (j &gt;= </trans> <trans data-src="t.length){//表示j已经访问了整个t子串了，即匹配成功">t. Length) {//indicates that j has accessed the entire t substring, which means the matching is successful</trans> <trans data-src="return i - j - 1; // 返回一开始成功匹配的位置">Return i - j - 1;//Returns the location of the first successful match</trans> <trans data-src="}else{">}else{</trans> <trans data-src="return -1;">return -1;</trans> <trans data-src="}">}</trans> <trans data-src="}</code></pre><h2>最后</h2>关于以上个人体会，有以下几点说明：<ol><li>代码是在编辑器手敲的，没有经过代码运行，很大可能是运行有错误，只是用来展示算法思路</li><li>可以看到KMP（以及改进后）算法本质上都是在处理子串j位置匹配失败，">}</code></pr e><h2>Finally</h2>About the above personal experience, there are the following points:,</trans> <trans data-src="子串的j需要如何移动。">How to move the j of the substring.</trans> <trans data-src="所以next(nextval)数组的意义就是如果该位置匹配失败了，当前位置应该移动到什么位置上</li><li>KMP算法和BP算法在匹配成功的过程中是完全一致的</li><li>理解可能有错误之处，请不吝指正</li></ol>">So the meaning of the next (nextval) array is that if the position matching fails, where the current position should be moved</li><li>KMP algorithm and BP algorithm are identical in the process of successful matching</trans>

Understanding of KMP algorithm

BP algorithm

KMP algorithm

Improved KMP algorithm

last

12 comments

Comment Cancel Reply
Use cookie technology to keep your personal information for your next quick comment. Continuing to comment means that you have agreed to the terms

Handsome -- a typecho theme

Focus -- not just RSS subscribers

My Personal Experience in the Postgraduate Entrance Examination of Beijing Post in 2019

Goodbye, 2016

Leaf — A Typecho Theme

Some small things about 2019

Sacrifice the Immortal Hunter who has played

Markdown Quick Start

Before graduation II

Five Centimeters in a Second -- Love is the greatest justice in the world

Understanding of KMP algorithm

BP algorithm

KMP algorithm

Improved KMP algorithm

last

12 comments

Comment Cancel Reply Use cookie technology to keep your personal information for your next quick comment. Continuing to comment means that you have agreed to the terms

Understanding of KMP algorithm

Comment Cancel Reply
Use cookie technology to keep your personal information for your next quick comment. Continuing to comment means that you have agreed to the terms