Vulnerability type: Information leakage
Station building procedure: other
Server type: currency
Programming language: other
Description: The robots.txt file was found on the target WEB site.
1.robots.txt is the first file to be viewed when a search engine visits a website.
-Stow
2. The robots.txt file will tell the spider program which files can be viewed and which files are not allowed to be viewed on the server. Take a simple example: when a search spider visits a site, it will first check whether robots.txt exists in the root directory of the site. If it does, the search robot will determine the scope of access according to the contents of the file; If the file does not exist, all search spiders will be able to access all pages on the website that are not protected by passwords. At the same time, robots.txt is publicly accessible by anyone. Malicious attackers can obtain sensitive directory or file path information by analyzing the contents of robots.txt.
harm: The robots.txt file may reveal sensitive information in the system, such as the background address or the address that is not willing to be disclosed to the public. Malicious attackers may use this information to carry out further attacks.
Solution: 1. Ensure that the robots.txt does not contain sensitive information. It is recommended that you use permission control for directories or files that you do not want to publish to the public, so that anonymous users cannot access these information
2. Move sensitive files and directories to another isolated subdirectory to exclude this directory from Web Robot search. A good solution is to move files to non specific directory names such as "folder": New directory structure:/folder/passwords. txt/folder/sensitive_folder/
New robots.txt: User-agent: * Disallow: /folder/
3. If you cannot change the directory structure and must exclude the specific directory from the Web Robot, please only use the local name in the robots.txt file. Although this is not the best solution, at least it can increase the difficulty of guessing the full directory name. For example, if you want to exclude "admin" and "manager", use the following name (assuming there is no file or directory starting from the same character in the Web root directory): robots. txt:
User-agent: *
Disallow: /ad
Disallow: /ma
Original address: http://webscan.360.cn/vul/view/vulid/139
-
I probably understand that robots will disclose the background or other sensitive addresses of the website. When I met an address that I didn't want people to know through robots, I will also use the third solution above and only write local strings. However, these are completely hidden practices. People with a clear eye can easily identify whether a blog is WordPress or other website building programs. It is impossible to hide any sensitive directory. Of course, it is useless to hide it. However, I am not happy to see that it is not 100 points, so I will hide my ears and solve the problem! My idea is very simple. For non spider crawling robots.txt, it will return to 403. That is, robots.txt is only open to spiders. The implementation is very simple. Add the following code to the Nginx configuration: #If robots.txt is requested and the spider is matched, 403 is returned location = /robots.txt { if ($http_user_agent !~* "spider|bot|Python-urllib|pycurl") { return 403; } }