Server anti crawler strategy: Apache/Nginx/PHP prohibits some user agents from crawling websites

one thousand four hundred and twenty-one
article

September 18, 2017 08:12:00 The server comment two hundred and sixty-six Reading mode

We all know that there are a lot of crawlers on the web, some of which are beneficial to website inclusion, such as Baidu spider, and some useless crawlers that not only do not follow the robots rules, but also can not bring traffic to the website. Recently, we found that there are many records of garbage crawlers in nginx logs, So I sorted out and collected various methods to prohibit garbage spiders from crawling on the network. While setting up my own web, I also provided reference for webmasters.

I Apache

① . By modifying the. htaccess file

Modify. htaccess under the website directory and add the following codes (2 codes are optional):

Available codes (1):

RewriteEngine?On
RewriteCond?%{HTTP_USER_AGENT}? (^$|FeedDemon|Indy? Library|Alexa?Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft?URL?Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports?Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms)? [NC]
RewriteRule?^(.*)$?-? [F]

Available codes (2):

SetEnvIfNoCase?^User-Agent$?.*(FeedDemon|Indy? Library|Alexa?Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft?URL?Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports?Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms)?BADBOT
Order?Allow,Deny
Allow?from?all
Deny?from?env=BADBOT

② . By modifying the httpd.conf configuration file

Find the following similar locations, add/modify according to the following code, and then restart Apache:

DocumentRoot?/home/wwwroot/xxx
<Directory? "/home/wwwroot/xxx" >
SetEnvIfNoCase?User-Agent? ".*(FeedDemon|Indy? Library|Alexa?Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft?URL?Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports?Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms)" ? BADBOT
???????? Order?allow,deny
???????? Allow?from?all
??????? deny?from?env=BADBOT
</Directory>

II Nginx code

Enter the conf directory under the nginx installation directory and save the following code as agent_deny. conf

cd /usr/local/nginx/conf

vim agent_deny.conf

#It is prohibited to grab tools such as Scrapy
if ? ( $http_user_agent ?~*? (Scrapy|Curl|HttpClient))? {
????? return ? 403;
}
#It is forbidden to specify UA and access with empty UA
if ? ( $http_user_agent ?~? "FeedDemon|Indy? Library|Alexa?Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft?URL?Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports?Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|^$" ?)? {
????? return ? 403;
}
#Forbid fetching in non GET | HEAD | POST mode
if ? ( $request_method ?!~?^ (GET|HEAD|POST)$)? {
???? return ? 403;
}

Then, insert the following code after location/{in the website related configuration:

include ? agent_deny.conf;

For example, the configuration of Zhang Ge's blog:

[ marsge@Mars_Server ?~]$? cat?/usr/local/nginx/conf/zhangge.conf
location?/? {
???????? try_files? $uri ? $uri /?/ index.php? $args ;
????????# Add 1 line at this position:
???????? include ? agent_deny.conf;
???????? rewrite?^/sitemap_360_sp.txt$?/sitemap_360_sp.php?last;
???????? rewrite?^/sitemap_baidu_sp.xml$?/sitemap_baidu_sp.php?last;
???????? rewrite?^/sitemap_m.xml$?/sitemap_m.php?last;

After saving, execute the following command and restart nginx smoothly:

/usr/local/nginx/sbin/nginx?- s?reload

III PHP code

Put the following method in the first<? After php, you can:

//Get UA information
$ua ?=? $_SERVER ['HTTP_USER_AGENT'];
//Storing malicious USER_AGENT into the array
$now_ua ?=? array ('FeedDemon?','BOT/0.1?(BOT? for ? JCE)','CrawlDaddy?','Java','Feedly','UniversalFeedParser','ApacheBench','Swiftbot','ZmEu','Indy?Library','oBot','jaunty','YandexBot','AhrefsBot','MJ12bot','WinHttp','EasouSpider','HttpClient','Microsoft?URL?Control','YYSpider','jaunty','Python-urllib','lightDeckReports?Bot');
//Forbid empty USER_AGENT, The mainstream collection programs such as dedecms are empty USER_AGENT, and some sql injection tools are also empty USER_AGENT
if (! $ua )? {
???? header( "Content-type:? text/html;?charset=utf-8" );
???? die ('Do not collect this station, because the collected stationmaster has no small JJ! ');
} else {
???? foreach ( $now_ua ? as ? $value ?)
//Determine whether the UA exists in the array
???? if ( eregi ( $value , $ua ))? {
???????? header( "Content-type:? text/html;?charset=utf-8" );
???????? die ('Do not collect this station, because the collected stationmaster has no small JJ! ');
????}
}

4、 Test effect

If it is vps, it is very simple. Use curl - A to simulate the capture, such as:

Simulate the crawling of searchable spiders:

 curl -I -A 'YisouSpider' zhangge.net

Simulate grab with empty UA:

 curl -I -A '' zhangge.net

Simulate Baidu spider capture:

 curl -I -A 'Baiduspider' zhangge.net

The screenshots of the three capture results are as follows:

It can be seen that Yisou Spider and the return of empty UA are 403 no access signs, while Baidu Spider successfully returns 200, indicating that it is effective!

The screenshot of the effect of viewing nginx logs the next day:

①、 The garbage collection with empty UA information is intercepted:

② . The prohibited UA is intercepted:

Therefore, for the collection of garbage spiders, we can analyze the website's access logs to find out the names of some spiders that we have never seen before. After the query is correct, we can add them to the prohibition list of the previous code to prevent crawling.

5、 Appendix: UA Collection

The following is a list of common garbage UA on the network, for reference only, and you are also welcome to add.

 FeedDemon              Content collection BOT/0.1 (BOT for JCE) sql injection CrawlDaddy sql injection Java                   Content collection Jullo                  Content collection Feedly                 Content collection UniversalFeedParser    Content collection Apache Bench cc Attacker Swiftbot               Useless crawler YandexBot              Useless crawler AhrefsBot              Useless crawler YisouSpider            Useless crawler (acquired by UC Shenma Search, this spider can be released!) MJ12bot                Useless crawler ZmEu phpmyadmin        Vulnerability scanning WinHttp                Collect cc attacks EasouSpider            Useless crawler HttpClient tcp attack Microsoft URL Control Scan YYSpider               Useless crawler Jaunty wordpress blasting scanner oBot                   Useless crawler Python-urllib          Content collection Indy Library           scanning FlightDeckReports Bot useless crawler Linguee Bot            Useless crawler

6、 References

Ask: http://www.uedsc.com/acquisition.html

Haohai: http://www.it300.com/article-15358.html

night sky: http://blog.slogra.com/post-135.html

PS: From Zhang Ge's blog. View the original link: https://zhangge.net/4458.html