_
http://note.youdao.com/noteshare?id=a3a533247e4c084a72c9ae88c271e3d1
python version: 3.5
IDE : pycharm 5.0.4
Packages to be used can be downloaded with pycharm:
File->Default Settings->Default Project->Project Interpreter
Select the python version and point the plus sign on the right to install the package you want
The website I selected is the Suzhou weather in China Weather Network. I am going to grab the last 7 days'weather and the highest/lowest temperatures.
http://www.weather.com.cn/weather/101190401.shtml
At the beginning of the program we added:
# coding : UTF-8
- 1
- 2
This tells the interpreter that the py program is utf-8 encoded and that the source program can be in Chinese.
Package to reference:
import requests
import csv
import random
import time
import socket
import http.client
# import urllib.request
from bs4 import BeautifulSoup
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
requests: html source code to grab web pages
csv: Write data to csv file
Random: take a random number
Time: time related operations
Sockets and http.client are used here for exception handling only.
BeautifulSoup: Used instead of regularly retrieving the contents of the corresponding label in the source code
urllib.request: Another way to grab html source code from a web page, but no requests are convenient (which I started with)
Get the html code from the web page:
def get_content(url , data = None):
header={
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'zh-CN,zh;q=0.8',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.235'
}
timeout = random.choice(range(80, 180))
while True:
try:
rep = requests.get(url,headers = header,timeout = timeout)
rep.encoding = 'utf-8'
# req = urllib.request.Request(url, data, header)
# response = urllib.request.urlopen(req, timeout=timeout)
# html1 = response.read().decode('UTF-8', errors='ignore')
# response.close()
break
# except urllib.request.HTTPError as e:
# print( '1:', e)
# time.sleep(random.choice(range(5, 10)))
#
# except urllib.request.URLError as e:
# print( '2:', e)
# time.sleep(random.choice(range(5, 10)))
except socket.timeout as e:
print( '3:', e)
time.sleep(random.choice(range(8,15)))
except socket.error as e:
print( '4:', e)
time.sleep(random.choice(range(20, 60)))
except http.client.BadStatusLine as e:
print( '5:', e)
time.sleep(random.choice(range(30, 80)))
except http.client.IncompleteRead as e:
print( '6:', e)
time.sleep(random.choice(range(5, 15)))
return rep.text
# return html_text
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
header is a parameter of requests.get to simulate browser access.
header can be obtained using the developer tools for chrome as follows:
Open chrome, press F12, select network
Visit the website again, find the first network request, and view its header
Timout is a set timeout, and the random number is chosen to prevent websites from identifying it as a crawler.(
Then use the requests.get method to get the source code of the web page,
rep.encoding ='utf-8'is to change the encoding format of the source code to UTF-8 (the Chinese part of the source code is not garbled)
Here are some exception handling
Return rep.text
Get the fields we need in html:
Here we mainly use BeautifulSoup
BeautifulSoup Document http://www.crummy.com/software/BeautifulSoup/bs4/doc/
First, use the developer tools to look at the source code of the page and find the appropriate location for the required fields
Find that the fields we need are all in the ul of the div with id = 7d.Date in h1 of each li, weather conditions in the first p label of each li, maximum and minimum temperatures in the space and i labels of each li.(
Thanks to Joey_Ko for pointing out the mistake that there will be no maximum temperature in the evening, so make a judgement.(
The code is as follows:
def get_data(html_text):
final = []
bs = BeautifulSoup(html_text, "html.parser") # Create BeautifulSoup object
body = bs.body # Get body part
data = body.find('div', {'id': '7d'}) # Found div with id of 7d
ul = data.find('ul') # Get ul part
li = ul.find_all('li') # Get all the li
for day in li: # Traverse through the contents of each li tag
temp = []
date = day.find('h1').string # Find Date
temp.append(date) # Add to temp
inf = day.find_all('p') # Find all p tags in li
temp.append(inf[0].string,) # Add the content of the first p tag (weather conditions) to temp
if inf[1].find('span') is None:
temperature_highest = None # The weather forecast may not have the highest temperature of the day (that's the case in the evening), so you need to add a judgment to output the lowest temperature
else:
temperature_highest = inf[1].find('span').string # Find the highest temperature
temperature_highest = temperature_highest.replace('℃', '') # In the evening, the website will change and there will be a temperature behind the maximum temperature.
temperature_lowest = inf[1].find('i').string # Find the lowest temperature
temperature_lowest = temperature_lowest.replace('℃', '') # The lowest temperature is followed by a temperature. Remove this symbol
temp.append(temperature_highest) # Add maximum temperature to temp
temp.append(temperature_lowest) #Add minimum temperature to temp
final.append(temp) #Add temp to final
return final
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
Write to file csv:
After the data is fetched, we will write them to a file with the following code:
def write_data(data, name):
file_name = name
with open(file_name, 'a', errors='ignore', newline='') as f:
f_csv = csv.writer(f)
f_csv.writerows(data)
- 1
- 2
- 3
- 4
- 5
Main function:
if __name__ == '__main__':
url ='http://www.weather.com.cn/weather/101190401.shtml'
html = get_content(url)
result = get_data(html)
write_data(result, 'weather.csv')
- 1
- 2
- 3
- 4
- 5
- 6
Then run it:
The generated weather.csv file is as follows:
To summarize, there are roughly 3 steps to grab content from a web page:
1. Simulate browser access to get html source code
2. Get the contents of the specified label through regular matching
3. Write the obtained content to a file
New python crawler, there may be some mistakes in understanding, please criticize and correct, thank you!