Website construction

The shell script implements the whole site cache and pre cache, further improving the overall loading speed of the website

Jager · April 16 · 2016 Website optimization 1731 times read

In Linux, the shell script combined with the system task planning crontab can easily realize the work that can only be completed by some complex programs. The development cost is low, and it is easy to learn.

Zhang Ge's blog has also shared many wonderful applications of shell in website operation before, such as:

CCKiller: Linux lightweight CC attack defense tool, second level check, automatic blackout and release

SEO skill: Shell script automatically submits website 404 dead link to search engine

Linux/vps local 7-day circular backup and 7-N remote backup script

Nginx log cutting and history log deletion script 7 days ago

Shell+Curl website health status check script, catch the lost website of China Blog Alliance

Those who are interested can choose and have a look.

This article continues to share a shell practical case: the whole site cache and scheduled pre cache, which further provides the website speed

Ps: At present, a more efficient Python pre caching tool has been launched. It is recommended to use: https://zhang.ge/5154.html

1、 What is pre caching

The webmaster who has used the WP Super cache plug-in must know that this plug-in has a pre caching function. After enabling this function, the plug-in will pre cache the entire site, and will update the cache regularly later.

Obviously, the advantage of the whole site pre cache is that the static cache has been generated before the user accesses, rather than being triggered by the user's access. Then almost all users access the static cache, and the average or overall speed will be improved qualitatively! Of course, the most important thing is to optimize the speed of spiders!

When you go to Baidu webmaster platform to check the crawl frequency, you can see the average time consuming data of spiders. My blog has made a static cache, and it is reasonable to say that each crawl will not exceed 500 ms, but there will still be some requests of more than ten or twenty seconds:

Excluding network latency or concurrent load when spiders crawl, another possible reason is that spiders just crawl a page whose cache has expired or does not exist. That is to say, when spiders crawl, the page cache has just expired and been deleted, so it is a dynamic page when it crawls, so the time is up!

Therefore, it is necessary to pre cache the whole site.

2、 Precache predecessor

Seeing the importance of pre caching, it's time to find a way to implement it. Before sharing methods, let's talk about the source of inspiration!

I remember that my blog shared various WordPress caching schemes before, including the php code version, the fast cig cache of nginx, etc. At that time, someone asked, is there any way to make the sitemap also statically cache (the pure code version of sitemap)?

At that time, sitemap.php was pseudostatically transformed into sitemap.xml, so it was dynamic data and placed in the root directory, so it was also possible to directly access sitemap.php. Because it was the data of the whole site, this file ran slowly!

Later, I used the Linux command+crontab to solve this problem: put sitemap.php in an unknown directory, then regularly use wget to request this file, and save the data as sitemap.xml to the root directory of the website! For example:

 #Generate a sitemap.xml diypath as the actual location of sitemap.php in the root directory of the website every day 0 1 * * * wget -O /home/wwwroot/zhang.ge/sitemap.xml  https://zhang.ge/diypath/sitemap.php   >/dev/null 2>&1

Ps: To use this method, pay attention to the requirement ('./wp blog header. php') in sitemap. php; Change to require ('../wp blog header. php'); That is, pay attention to the relative position!

In this way, the problem that sitemap.xml is dynamic data is solved!

3、 Whole site pre cache

With the above case, it is really too simple to implement the whole site pre caching.

There are several implementation forms as follows:

① Blogs with cache function

For blogs with caching functions, such as installing a caching plug-in or using the nginx cache, you only need to pull out all article IDs or aliases from the database, then form a page address, and finally use wget or curl to request all of them once to achieve caching, such as:

 #! bin/bash #My blog uses an alias, so it is select post_name. If the fixed link is ID, then it is select ID For post in $(mysql - uroot - p database password - e "use database name; select post_name from wp_posts where post_type='post 'and post_status='publish';" | sed - n'2, $p ') do #Use wget to request the page and throw the data to the "black hole file", that is, do not save it, but trigger the caching function of the blog wget -O /dev/null  " https://zhang.ge/ ${post}.html" Sleep 0.5 # pause for 0.5s to avoid high load done

However, the fixed address of each blog may be different, so this splicing of ID or alias cannot be copied, and the classification, tag, etc. are not covered in place, which is a pity.

I was too lazy to study how to get all the pages from the database. Finally, I used a lazy method: get the page address from sitemap. xml!

Almost every website will have a sitemap.xml file. If your website does not have one, you should first Refer to the previous article Get one!

So the script can be changed to the following code:

 #/bin/bash #Enter the root directory of the website, please fill in according to the actual situation cd /home/wwwroot/zhang.ge/ #Get all page addresses from sitemap.xml, request every 0.5 seconds, and trigger the cache. for url in $(awk -F"<loc>|</loc>" '{print $2}' sitemap.xml) do wget -O /dev/null  $url sleep 0.5 done

Save this code as g_cache.sh after actual modification, and upload it to the Linux system, for example, place it in the/root directory. Execute it manually first to see if it succeeds:

bash /root/g_cache.sh

As shown in the figure, if there is no error report (the shocking speed in the figure needs no attention and is related to disk IO), finally add a task plan:

 #Pre cache all stations at 3:00 every morning 0 3 * * * bash /root/g_cache.sh  >/dev/null 2>&1

Duang, you can do it!

② , Blogs without cache

The blog without cache means that you don't like cache, and it may not be necessary to enable cache, so the following is just written to maintain the integrity of the article. It's good for you to have a selective look!

For the blog without cache, there are two ways to pre cache the whole site:

After installing the cache plug-in or enabling other caches, use method ① to implement

I won't open the cache, but I still want to use the whole site pre cache. What do you want to do!

The first way doesn't need to be wordy, just share how to implement the second way.

From step ①, we can see that we only request the page, but do not save the data, and throw all the black holes. What if I save the data as the corresponding html file and store it in the directory corresponding to the website? Then we can implement the same static cache as the cos real html plug-in?

Obviously, it's OK! The codes are as follows:

 #!/ bin/bash #Please fill in the root directory of the website according to the actual situation base_dir=/data/wwwroot/zhang.ge #Do not cache the list, fill in the page address keywords that do not need to be cached, separated by the split number white_list="go.html|goto.html|liuyan.html" #Define Cache Folder Name cache_store=$base_dir/html_cache #Get all page addresses from sitemap.xml for url in $(awk -F"<loc>|</loc>" '{print $2}' $base_dir/sitemap.xml | sed '/^$/d' | egrep -v $white_list) do #Get the request path of the page address, such as cat/1.html cache_dir=${url#http*://*/} #Judge the homepage and delete the old cache file if echo $cache_dir | egrep "http.*://" >/dev/null ; then cache_dir=$cache_store #If the script is executed without parameters, the existing cache will be skipped, that is, if any parameters are taken, all caches will be rebuilt if [[ -z $1 ]]; then test -f $cache_store/index.html && continue fi else cache_dir=$cache_store/$cache_dir if [[ -z $1 ]]; then test -f $cache_store/index.html && continue fi fi #Create cache directory mkdir -p $cache_dir #Save the page content to the index.html file in the corresponding cache directory, similar to wp super cache curl -o $cache_dir/index.html $url  sleep 0.5 done

According to the actual situation, modify the website root directory and cache whitelist in the code, save them as g_cache.sh, and upload them to the server. Then we need to add a Nginx pseudo static, which is just like the previous wp super cache:

 location / { try_files $uri $uri/ /index.php?$ args; # WordPress default pseudo static rules. I can add the following rules below if (-f $request_filename) { break; } set $cache_file ''; set $cache_uri $request_uri; if ($cache_uri ~ ^(.+)$) { #Please note that the path of the following line of code corresponds to the path defined by CACHE_ROOT in the cache code: set $cache_file /html_cache/$1/index.html; } #Rewrite only when the cache file exists if (-f $document_root$cache_file) { #rewrite ^(.*)$ $cache_file break; rewrite ^ $cache_file last; } #All other requests are transferred to wordpress for processing if (!-e $request_filename) { rewrite . / index.php last; } }

After saving, reload overloads nginx to take effect.

Finally, create a new schedule task as follows and execute g_cache.sh regularly:

 #Refresh the pre cache of the whole site at 3:00 a.m. every Monday (such as script comments, and rebuild the whole site cache with any parameters) 0 3 * * * bash /root/g_cache.sh all >/dev/null 2>&1 #Check the cache every hour. If there are articles without cache, generate (for new article publishing) 0 */1 * * * bash /root/g_cache.sh >/dev/null 2>&1

In this way, the wp super cache pre caching and cos real hmtl static caching functions are realized.

4、 The final wordiness

In fact, I think the biggest highlight of this article is the last script, which has implemented caching and pre caching. The Shenma cache plug-in and Shenma pseudo static can be thrown away! Moreover, as long as the website has a sitemap. xml file, you can achieve static caching, not limited to what the website building program is!

However, in addition to being cool, we still have some details to pay attention to. Please look carefully.

① , host resolution

Since it is captured locally in the whole server, in order to improve the speed and shorten the path, it is strongly recommended to resolve the website domain name to the server IP in hosts instead of external DNS resolution to reduce the resolution time or CDN consumption.

It is very simple. Edit the/etc/hosts file and insert a resolution in it, such as:

127.0.0.1 zhang.ge

Finally, save it.

② , generation interval

The planning tasks shared in the article are all once a day. If you think it is necessary to shorten the interval, you can modify the crontab statement by yourself. You can search the crontab configuration to understand the definition of time sharing days, months and weeks in the crontab, which will not be repeated here.

③ Cache deletion

This article only shares how to generate cache, not how to automatically delete cache. On the whole, crontab will regenerate the cache on a regular basis anyway, and in principle, it does not care about automatically refreshing the cache.

However, some obsessive-compulsive disorder often scratch their heads when they see that the comments do not refresh, and the articles are not refreshed after being modified. So here's a way out...

For websites with caching function, the use of this pre caching script will actually have no impact. If the cache was automatically refreshed before, it will still be refreshed now, and no operation is required.

For the website using the last script, it also implements the same function as the previously shared php generated html cache. If you want to delete the cache when updating articles or submitting comments, you can refer to the previous blog articles and modify the cache path:

WP Super Cache static cache plug-in code only version (compatible with multi domain websites)

Oh, share here, leave a word if you need to...

Latest supplement : Lazy use of sitemap.xml feels a little low, so I'd better provide a scheme without sitemap.xml! In order not to be confused with the above content, it is better to start a new page. If you need it, you can look at it. If you don't need it, please ignore it.

Website pre caching tool to improve the overall loading speed of the website

February 8, 2020

Zhang Ge Blog shares experience and skills of using CloudFlare CDN to accelerate

June 2, 2019

The most simple and practical optimization scheme of static and dynamic separation for small websites

February 3, 2018

Record of solving the problem of computer CPU soaring caused by blog pages

October 20, 2017

58 responses

Black Apple Paradise 2016-4-16 · 15:24

Amazing Jager!

It's very easy to use, speed bar!
Small C blog 2016-4-16 · 18:04

Your speed is very 666, and I am also thinking about improving the speed of the server recently!!
Reten 2016-4-16 · 18:49

No wonder Jager's website speed can't be imitated. The original whole site cache.
The road of struggle 2016-4-16 · 20:19

Your website feels like an ice bucket
Yan Yuxi. 2016-4-17 · 1:50

Brother Zhang is still smart, hahaha
My heart is tied to her 2016-4-17 · 17:38

Brother Zhang asked, will it occupy memory?
- Jager 2016-4-17 · 17:53
  
  The execution of the script may bring a certain amount of load, with little impact. You can reduce the request speed by controlling usleep.
  - My heart is tied to her 2016-4-17 · 17:57
    
    Like one
  - Lizhini.com Inspires You 2016-4-22 · 17:42
    
    Your blog is a purely technical blog! It is the same as the template I used
Microblog 2016-4-18 · 22:03

I still use the virtual host all the time. I use Baidu Cloud Cache, but it doesn't feel smooth enough
Dream returning blog 2016-4-19 · 20:04

When I open your site, the pop-up box will always prompt me what to retry after timeout, and the computer and mobile phone will prompt me
- Jager 2016-4-21 · 8:13
  
  It should be Baidu sharing
Yangyangyang Blog 2016-4-20 · 11:30

It's really good to be at the top. I've learned something
Qunxian calligraphy 2016-4-20 · 18:51

It's very powerful
Long Xiaotian 2016-4-20 · 20:40

[color=green] Completely incomprehensible 0.0 [/color]: shock:
Jager 2016-4-21 · 8:14

The script uses the disk physical path of sitemap.xml, not the url address.
- CC 2016-4-21 · 9:32
  
  What do you mean? I need to create a timing task first, and let it generate an xml to reference this address.
- CC 2016-4-21 · 10:07
  
  I can't do it according to your method on the second page. I'm depressed!
  It is convenient to add QQ: 75656768
  Say something? There is a reply here. Why can't I see it? Is it cached?
- CC 2016-4-21 · 22:54
  
  The code is updated. Thank you today! It took half a day..
Ding Chunhua 2016-4-21 · 22:33

I found another technical talent, I admire him.
Pure blog 2016-4-22 · 9:06

Yes, I can try it
Animation 2016-4-22 · 16:38

Every time I receive an email from you, but when I come to your website, my comments and your replies haven't been printed yet? Why not delete the cache and replace it quickly?
- Animation 2016-4-22 · 16:40
  
  In the message board, the article page will not be.
  Do you use the agent of fastcgi cache or the local mode
- Jager 2016-4-22 · 17:43
  
  Theoretically, it is cleared immediately. There is also a manual refresh button in the lower right corner
Dress 2016-4-22 · 22:38

I'm not a technical person! But I worship the god of technology!!! Come and support!
CC 2016-4-29 · 1:05

According to your method on the second page, I tried it myself, but I used GG browser to check the cache. It shows that there is no cache, but I have generated html pages in the root directory. What is the problem?
- Jager 2016-4-29 · 12:50
  
  One variable is the root directory of the website, which needs to be adjusted to the actual directory of your website, otherwise it will be generated in the same directory of the script
  - CC 2016-4-30 · 18:23
    
    I changed the address.
    In the second method, whether to modify the Nginx rule, and add a piece of code in indxe.php. To call the html page normally.
    I have read your posts for many times, step by step in the operation, I am ready to rewrite all your posts once, so that the white people can understand at a glance.
    Now there are many questions to trouble you, many questions to ask you. If you have time, please reply. thank you!
Baiyi Bar 2016-5-1 · 13:51

After caching, I found that 404? Will appear when I click the next page? What's the reason for this?
- Jager 2016-5-1 · 15:50
  
  This problem does exist. I will see how to be compatible in the evening.
- Jager 2016-5-1 · 22:46
  
  If the nginx rule is added, it should not be 404.
  - Baiyi Bar 2016-5-2 · 14:52
    
    Can you tell me which nginx it is? I tried it. I used the first one and also used it! Wp super cache is used as cache. Is it unnecessary?
    - Jager 2016-5-2 · 20:50
      
      There are plug-ins that can ignore this article.
      - Baiyi Bar 2016-5-23 · 18:43
        
        I removed the code cache version, but I found that the speed was still not up, so I used the whole site cache of this article. But I don't know whether it was called. In theory, I use CDN not so slow!
  - Baiyi Bar 2016-5-2 · 15:29
    
    I finally found the reason, because I have a sentence in nginx:
    
    rewrite ^/([a-z-A-Z]+)/(\d+)$ /$1/ permanent;
    - CC 2016-5-2 · 22:45
      
      Why didn't I call HTML when I cached it. Depressed.
      - Baiyi Bar 2016-5-8 · 18:07
        
        How do you check whether there are calls?
Seven swords 2016-5-4 · 2:51

Another good way
Dream back 2016-5-4 · 14:55

Tested. Usleep 5000 is not 0.5 seconds, while usleep 500000 is 0.5 seconds
- Jager 2016-5-4 · 15:48
  
  Indeed, mathematics is too bad.
wechat Business 2016-5-12 · 18:06

Very fast
A drifting bottle full of happiness 2016-5-18 · 16:19
Brother Zhang, the sitemap.xml of my website is https://zhang.ge/4554.html This, pseudo static,
I changed the code to
```
 #/bin/bash #Enter the root directory of the website, please fill in according to the actual situation cd /www/web/mamijie.cc_s5k3PL2/public_html/ wget -c  http://www.mamijie.cc/sitemap.xml #Get all page addresses from sitemap.xml, request every 0.5 seconds, and trigger the cache. for url in $(awk -F"|" '{print $2}' sitemap.xml) do wget -O /dev/null  $url usleep 500000 done
```
That is, the download of sitemap.xml to the local is added, which can also be executed normally. Is there any adverse effect? Or is there any other way
- A drifting bottle full of happiness 2016-5-18 · 19:27
  
  Another problem is that I used the cloud distribution of verycloud, and I saw a low hit rate in the background http://www.xxxx.com/1.html Before the file, it is not cached in verycloud at this time. The header information is MISS. Access 1.html again is to access the verycloud cloud distribution. The information of the header is HIT. I don't know if there is any error in my understanding.) Now I want to use this function to access the whole site. But after I test that you use this shell to access the whole site, when I visit 2.html, check whether the header status is MISS, and then visit HIT again
  - Jager 2016-5-24 · 0:09
    
    Did you get local parsing? CDN bypassed
    - A drifting bottle full of happiness 2016-5-24 · 10:35
      
      At first, I didn't do local analysis. After a day or two, I found that the hit rate was still the same
      Only later did I get local parsing to speed up
Ant Blog 2016-5-23 · 11:32

Yes, but the script is too professional for me to wait for Xiaobai~
CC 2016-5-28 · 13:19

This pseudo static, after generating the HTML page.
Do you want to go to wp-config.php to add? Do you want to add this?
define('WP_CACHE', true); // Added by WP-Cache Manager
define('CACHE_ROOT', dirname(__FILE__).'/ data/wwwroot/ccmsn.com/html_cache/'.$_ SERVER['HTTP_HOST'])

After I add it, the home page is blank, and the execution script also displays 500 No HTML generated.
This is my conf
#Please note that the path of the following line of code corresponds to the path defined by CACHE_ROOT in the cache code:
set $cache_file /data/wwwroot/ccmsn.com/html_cache/$1/index.html;

A drifting bottle full of happiness 2016-6-4 · 10:25

I use cdn
Using Brother Zhang's code, I found that some of the parsed cdns would connect overtime, and then be stuck there for a long time. Can I change to the following code? Try to connect twice and exit after 3 seconds

 #/bin/bash #Enter the root directory of the website, please fill in according to the actual situation cd /www/web/mamijie.cc_s5k3PL2/public_html/ wget -c  http://www.mamijie.cc/sitemap.xml #Get all page addresses from sitemap.xml, request every 0.5 seconds, and trigger the cache. for url in $(awk -F"|" '{print $2}' sitemap.xml) do wget -O -t 2 -T 3  /dev/null  $url usleep 500000 done

Jager 2016-8-8 · 8:21

With the CDN, you can locally parse the website to 127.0.0.1 in/etc/hosts to bypass the CDN

MCC5 University 2016-6-12 · 11:30

Blog is good, often come to visit.
Tang Qing 2016-7-4 · 22:07

Post cache comment submission prompt 405 error
seonoco 2016-7-18 · 10:49

Get a simple and easy to use new method!
Zen Cat 2017-1-13 · 20:06

I used to like the trouble of asking for trouble as much as you: sad:
Even use several vps of Wordpress for load balancing.. I forgot what the original blog was about: arrow:
- Jager 2017-1-13 · 20:38
  
  Learn and progress in the struggle
Xiaoyu Studio 2018-4-29 · 17:20

Yes, I used the pre cache generated by wp rocket. It seems more convenient than wp super cache, but it costs money!
sheruo 2019-1-2 · 12:36

Can Redis use this method?
Goy 2019-1-5 · 14:41

Hello Blog, I made it according to your operation, but when using the bash/root/g_cache.sh command, it only caches the home page and none of the other caches. My sitemap is generated by using plug-ins. Because of the theme, you can't use your code version of the sitemap.
- Jager 2019-1-20 · 13:50
  
  You may need to adapt your sitemap