Python crawler (12): basic use of urllib (2)

Life is short. I use python

Previous portal:

Little white Python crawler (1): the beginning

Python crawler (2): pre preparation (1) installation of basic class library

Little white learning Python crawler (3): pre preparation (2) introduction to Linux foundation

Little white learning Python crawler (4): pre preparation (3) introduction to docker Foundation

Little white Python crawler (5): pre preparation (4) database foundation

Python crawler (6): pre preparation (5) installation of crawler framework

Little white Python crawler (7): http Basics

Python crawler (8): Fundamentals of web page

Little white Python reptile (9): reptile Foundation

Python crawler (10): session and cookies

Python crawler (11): basic use of urllib (1)

introduction

In the last article, we talked about the basic usage posture of urlopen, but these simple parameters are not enough to build a complete request. For complex requests, for example, it seems powerless to add a request header. At this time, we can choose to use request.

Request

Official documents: https://docs.python.org/zh-cn…

First, let’s look at the usage syntax of request:

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
  • URL: the address link of the request. Only this is a required parameter, and the rest are optional parameters.
  • Data: if this parameter needs to be passed, it must be of type bytes.
  • Headers: request header information, which is a dictionary. It can be constructed between headers when constructing a request, or add can be called_ Header () added.
  • origin_ req_ Host: the host name or IP address of the requesting party.
  • Unverifiable: refers to whether the request cannot be verified. The default is false. This means that the user does not have sufficient privileges to choose to receive the result of this request. For example, if we request an image in an HTML document, but we do not have the permission to automatically grab the image, the value of unverifiable is true.
  • Method: request method, such as get, post, put, delete, etc.

Let’s start with a simple example. Use request to crawl the blog website:

import urllib.request

request = urllib.request.Request('https://www.geekdigging.com/')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

As you can see, urlopen() is still used to initiate the request, but the parameters are no longer the previous URL, data, timeout and other information, but insteadRequestObject of type.

Let’s build a slightly more complex request.

import urllib.request, urllib.parse
import json

url = 'https://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
    'Content-Type': 'application/json;encoding=utf-8',
    'Host': 'geekdigging.com'
}
data = {
    'name': 'geekdigging',
    'hello':'world'
}
data = bytes(json.dumps(data), encoding='utf8')
req = urllib.request.Request(url=url, data=data, headers=headers, method='POST')
resp = urllib.request.urlopen(req)
print(resp.read().decode('utf-8'))

The results are as follows:

{
  "args": {}, 
  "data": "{\"name\": \"geekdigging\", \"hello\": \"world\"}", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "41", 
    "Content-Type": "application/json;encoding=utf-8", 
    "Host": "geekdigging.com", 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"
  }, 
  "json": {
    "hello": "world", 
    "name": "geekdigging"
  }, 
  "origin": "116.234.254.11, 116.234.254.11", 
  "url": "https://geekdigging.com/post"
}

Here we build aRequestObject.

The URL specifies the link to visit or the test link mentioned in the previous article.

Specified in headersUser-AgentContent-TypeandHost3 parameters.

Using in datajson.dumps()Convert a dict to JSON format and use thebytes()Finally, it is converted to a byte stream.

Finally, the access mode is specified as post.

From the final result, we can see that our previous settings are all successful.

Advanced operation

Previously, we used request to add request headers. If we want to process cookies and use proxy access, we need to use a more powerful handler. Handler can be simply understood as a processor of various functions. With it, almost everything about HTTP requests can be done for us.

urllib.request The basehandler class is provided for us. It is the parent class of all other handlers. It provides the methods for direct use as follows:

  • add_ Parent(): add director as parent class.
  • Close(): close its parent class.
  • Parent(): open using a different protocol or handling errors.
  • default_ Open (): capture all URL and subclasses and call them before the protocol is opened.

Next, there are various handler subclasses integrating the basehandler class:

  • Httpdefaulterrorhandler: used to handle HTTP response errors, which throw an exception of httperror class.
  • Http redirecthandler: used to handle redirection.
  • Proxyhandler: used to set the proxy. The default proxy is empty.
  • Http passwordmgr: used to manage passwords, which maintains user names and password tables.
  • Abstractbasicauthhandler: used to get the user / password pair and retry the request to process the authentication request.
  • HTTP basic authhandler: used to retry a request with authentication information.
  • HTTP cookie processor: used to process cookies.

Wait, there are many basehandler subclasses provided by urlib for us. I will not list them here. You can check them by visiting the official documents.

Official document address: https://docs.python.org/zh-cn…

Before I introduce how to use handler, I introduce an advanced class: openerdirector.

Openerdirector is a high-level class used to process URLs. It has three stages to open URLs:

The order of calling these methods at each stage is determined by sorting the handler instance; every program that uses this method calls protocol_. The request () method preprocesses the request and then calls protocol_. Open () handles the request; finally calls protocol_. The response () method to process the response.

We can call opener director opener. We used the urlopen () method before. In fact, it is an opener provided by urllib.

Opener’s methods include:

  • add_ Handler (handler): add handler to link
  • Open (URL, data = none [, timeout]): open the given URL the same as the urlopen() method
  • Error (proto, * args): handling errors for a given protocol

Let’s demonstrate how to get cookies from the website:

import http.cookiejar, urllib.request

#Instantiate Cookie Jar object
cookie = http.cookiejar.CookieJar()
#Building a handler using HTTP cookie processor
handler = urllib.request.HTTPCookieProcessor(cookie)
#Building opener
opener = urllib.request.build_opener(handler)
#Initiate a request
response = opener.open('https://www.baidu.com/')
print(cookie)
for item in cookie:
    print(item.name + " = " + item.value)

The specific meaning of the code is no longer explained by the editor, and the notes have been well written. The final printing results are as follows:

<CookieJar[<Cookie BAIDUID=48EA1A60922D7A30F711A420D3C5BA22:FG=1 for .baidu.com/>, <Cookie BIDUPSID=48EA1A60922D7A30DA2E4CBE7B81D738 for .baidu.com/>, <Cookie PSTM=1575167484 for .baidu.com/>, <Cookie BD_NOT_HTTPS=1 for www.baidu.com/>]>
BAIDUID = 48EA1A60922D7A30F711A420D3C5BA22:FG=1
BIDUPSID = 48EA1A60922D7A30DA2E4CBE7B81D738
PSTM = 1575167484
BD_NOT_HTTPS = 1

A problem arises here. Since cookies can be printed, can we save the output of cookies to a file?

The answer, of course, is yes, because we know that cookies themselves are stored in files.

#Examples of cookies saving Mozilla files
filename = 'cookies_mozilla.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)
print('cookies_ Mozilla saved successfully ')

Here we need to modify the previousCookieJarbyMozillaCookieJar, which is a subclass of cookiejar. It can be used to process cookies and file related events, such as reading and saving cookies, which can be saved to the cookie format of Mozilla browser.

After running, we can see that acookies.txtThe specific contents are as follows:

# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This is a generated file!  Do not edit.

.baidu.com    TRUE    /    FALSE    1606703804    BAIDUID    0A7A76A3705A730B35A559B601425953:FG=1
.baidu.com    TRUE    /    FALSE    3722651451    BIDUPSID    0A7A76A3705A730BE64A1F6D826869B5
.baidu.com    TRUE    /    FALSE        H_PS_PSSID    1461_21102_30211_30125_26350_30239
.baidu.com    TRUE    /    FALSE    3722651451    PSTM    1575167805
.baidu.com    TRUE    /    FALSE        delPer    0
www.baidu.com    FALSE    /    FALSE        BDSVRTM    0
www.baidu.com    FALSE    /    FALSE        BD_HOME    0

Xiaobian is lazy, so she won’t take screenshots and post the results directly.

Of course, in addition to saving cookies in the format of Mozilla browser, we can also save cookies in the format of libwww Perl (LWP).

To save the cookies file in LWP format, you need to modify it to lwpkookiejar when declaring:

#Example of LWP type file saved by cookies
filename = 'cookies_lwp.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)
print('cookies_ LWP saved successfully ')

The results are as follows:

#LWP-Cookies-2.0
Set-Cookie3: BAIDUID="D634D45523004545C6E23691E7CE3894:FG=1"; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2020-11-30 02:45:24Z"; comment=bd; version=0
Set-Cookie3: BIDUPSID=D634D455230045458E6056651566B7E3; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2087-12-19 05:59:31Z"; version=0
Set-Cookie3: H_PS_PSSID=1427_21095_30210_18560_30125; path="/"; domain=".baidu.com"; path_spec; domain_dot; discard; version=0
Set-Cookie3: PSTM=1575168325; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2087-12-19 05:59:31Z"; version=0
Set-Cookie3: delPer=0; path="/"; domain=".baidu.com"; path_spec; domain_dot; discard; version=0
Set-Cookie3: BDSVRTM=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0
Set-Cookie3: BD_HOME=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0

As you can see, the difference between the two types of cookie file formats is very large.

A cookie file has been generated. The next step is to add a cookie when requesting. The example code is as follows:

#Request is to use Mozilla type file
cookie = http.cookiejar.MozillaCookieJar()
cookie.load('cookies_mozilla.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

Here we use the load () method to read the local cookies file and get the contents of the cookies.

The premise is that we need to generate the cookie file in Mozilla format in advance, then read the cookies and use the same method to build the handler and opener.

When the request is normal, you can ferry the source code of the home page accordingly. As a result, the small edition will not be pasted, which is a little long.

This is the end of the article. I hope you can remember to write your own code~~~

Sample code

All the code in this series will be put on GitHub and gitee, which are convenient for you to access.

Sample code GitHub

Sample code – gitee

reference resources

https://www.cnblogs.com/zhang…

https://cuiqingcai.com/5500.html

Python crawler (12): basic use of urllib (2)

If my article is helpful, please scan the code to pay attention to the official account of the author: get the latest dry cargo push: