How to use python crawler to crawl data (6-step solution)

How to use python crawlers to crawl data? In fact, crawling data with python crawlers is really simple, as long as you master the following six steps, it is not complicated. I used to think that reptiles were very difficult, but once I got started, I found that it was also very simple. If you don't believe me, you can try it.

Step 1: Install the requests library and Beautiful Soup library

The writing of the two libraries in the program is as follows:

import requests
from bs4 import BeautifulSoup

Because I use python programming by pycharm. So I will talk about how to install these two libraries on pycharm. Under the main page file option, find the settings. Find the project interpreter further. Then click the+sign on the software package in the selected box to install the query plug-in. Hxd with compiler plug-in installation is expected to be a good place to start. See the following figure for details.

Step 2: Get the header and cookie required by the crawler

I wrote a crawler program that crawls popular microblogs. Let's take it as an example. Getting headers and cookies is a must for a crawler, which directly determines whether the crawler can accurately find the location of the web page to crawl.

First, enter the hot search page of microblog, press F12, and the js language design part of the webpage will appear. Locate the Network section on the page. Then press ctrl+R to refresh the page. If there is file information in progress, there is no need to refresh. Of course, it is OK to refresh. Then, we browse the Name section, find the file we want to crawl, right-click, select copy, and copy the URL of the next page.

After copying the URL, we will enter a webpage Convert curl commands to code. This webpage can automatically generate headers and cookies according to the URL you copied. The generated headers and cookies can be directly copied and pasted into the program.

#Crawler head data
cookies = {
‘SINAGLOBAL’: ‘6797875236621.702.1603159218040’,
‘SUB’: ‘_2AkMXbqMSf8NxqwJRmfkTzmnhboh1ygvEieKhMlLJJRMxHRl-yT9jqmg8tRB6PO6N_Rc_2FhPeZF2iThYO9DfkLUGpv4V’,
‘SUBP’: ‘0033WrSXqPxfM72-Ws9jqgMF55529P9D9Wh-nU-QNDs1Fu27p6nmwwiJ’,
‘_s_tentry’: ‘www.baidu.com’,
‘UOR’: ‘www.hfut.edu.cn, widget.weibo.com,www.baidu.com’,
‘Apache’: ‘7782025452543.054.1635925669528’,
‘ULV’: ‘1635925669554:15:1:1:7782025452543.054.1635925669528:1627316870256’,
}
headers = {
‘Connection’: ‘keep-alive’,
‘Cache-Control’: ‘max-age=0’,
‘Upgrade-Insecure-Requests’: ‘1’,
‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36 SLBrowser/7.0.0.6241 SLBChan/25’,
‘Accept’: ‘text/html, application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9’,
‘Sec-Fetch-Site’: ‘cross-site’,
‘Sec-Fetch-Mode’: ‘navigate’,
‘Sec-Fetch-User’: ‘? 1’,
‘Sec-Fetch-Dest’: ‘document’,
‘Accept-Language’: ‘zh-CN,zh;q=0.9’,
}
params = (
(‘cate’, ‘realtimehot’),
)

Copy into the program like this. This is the request header of Weibo Hot Search.

Step 3: Get the web page

After we get the header and cookie, we can copy them into our program. Then, use the request request to get the web page.

#Get Web Page
response = requests.get(‘ https://s.weibo.com/top/summary ’, headers=headers, params=params, cookies=cookies)

Step 4: Parse the web page

At this time, we need to go back to the website. Also press F12 to find the Elements section of the web page. Use the small box with an arrow in the upper left corner to click the page content, and the page will automatically display the code corresponding to the part of the page you obtained on the right.

After finding the web page code of the part of the page we want to crawl, place the mouse on the code, right-click, and copy to the selector part.

Step 5: Analyze the information obtained and simplify the address

In fact, the selector just copied is equivalent to the address stored in the corresponding part of the web page. Because we need a kind of information on the web page, we need to analyze and extract the obtained address. Of course, it is not impossible to use that address, that is, you can only get that part of the content on the page you choose.

#pl_top_realtimehot > table > tbody > tr:nth-child(1) > td.td-02 > a
#pl_top_realtimehot > table > tbody > tr:nth-child(2) > td.td-02 > a
#pl_top_realtimehot > table > tbody > tr:nth-child(9) > td.td-02 > a

These are the three addresses I obtained. It can be found that the three addresses have many similarities. The only difference is the tr part. Since tr is a web page tag, the following part is its complement, that is, the subclass selector. It can be inferred that this type of information is stored in the subclass of tr, and we can directly extract information from the tr to obtain all the information corresponding to this part. So the refined address is:

#pl_top_realtimehot > table > tbody > tr > td.td-02 > a

Hxd, which has some knowledge of js languages, is expected to be better handled in this process. However, it doesn't matter if you don't have the foundation of the js language. The main step is to keep the same parts. Try slowly, and it will always be right.

Step 6: crawling content and cleaning data

After this step is completed, we can directly crawl the data. Use a tag to store the extracted address like things. The tag will pull the page content we want to get.

#Crawl content
content=”#pl_top_realtimehot > table > tbody > tr > td.td-02 > a”

Then we need to filter out unnecessary information, such as js language, by using soup and text to eliminate the interference of such language to the reading of information audience. In this way, we successfully took down the information.

Fo=open ("./Weibo hot search. txt", 'a', encoding=”utf-8″)
a=soup.select(content)
for i in range(0,len(a)):
a[i] = a[i].text
fo.write(a[i]+’\n’)
fo.close()

I stored the data in a folder, so there will be write operations brought by wirte. It is up to the reader to decide where to store the data or how to use it.

Code examples and results of popular microblog searches:

import os
import requests
from bs4 import BeautifulSoup
#Crawler head data
cookies = {
‘SINAGLOBAL’: ‘6797875236621.702.1603159218040’,
‘SUB’: ‘_2AkMXbqMSf8NxqwJRmfkTzmnhboh1ygvEieKhMlLJJRMxHRl-yT9jqmg8tRB6PO6N_Rc_2FhPeZF2iThYO9DfkLUGpv4V’,
‘SUBP’: ‘0033WrSXqPxfM72-Ws9jqgMF55529P9D9Wh-nU-QNDs1Fu27p6nmwwiJ’,
‘_s_tentry’: ‘www.baidu.com’,
‘UOR’: ‘www.hfut.edu.cn, widget.weibo.com,www.baidu.com’,
‘Apache’: ‘7782025452543.054.1635925669528’,
‘ULV’: ‘1635925669554:15:1:1:7782025452543.054.1635925669528:1627316870256’,
}
headers = {
‘Connection’: ‘keep-alive’,
‘Cache-Control’: ‘max-age=0’,
‘Upgrade-Insecure-Requests’: ‘1’,
‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36 SLBrowser/7.0.0.6241 SLBChan/25’,
‘Accept’: ‘text/html, application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9’,
‘Sec-Fetch-Site’: ‘cross-site’,
‘Sec-Fetch-Mode’: ‘navigate’,
‘Sec-Fetch-User’: ‘? 1’,
‘Sec-Fetch-Dest’: ‘document’,
‘Accept-Language’: ‘zh-CN,zh;q=0.9’,
}
params = (
(‘cate’, ‘realtimehot’),
)
#Data storage
Fo=open ("./Weibo hot search. txt", 'a', encoding=”utf-8″)
#Get Web Page
response = requests.get(‘ https://s.weibo.com/top/summary ’, headers=headers, params=params, cookies=cookies)
#Parse Web Page
response.encoding=’utf-8′
soup = BeautifulSoup(response.text, ‘html.parser’)
#Crawl content
content=”#pl_top_realtimehot > table > tbody > tr > td.td-02 > a”
#Cleaning data
a=soup.select(content)
for i in range(0,len(a)):
a[i] = a[i].text
fo.write(a[i]+’\n’)
fo.close()

The above is the specific method of using python crawlers to crawl data.

How to use python crawler to crawl data (6 steps to solve the problem)

Related recommendations

Popular articles

Latest articles

Hot tags

Website Statement