Python爬蟲Requests庫如何使用

本篇內(nèi)容主要講解“Python爬蟲Requests庫如何使用”，感興趣的朋友不妨來看看。本文介紹的方法操作簡單快捷，實(shí)用性強(qiáng)。下面就讓小編來帶大家學(xué)習(xí)“Python爬蟲Requests庫如何使用”吧!

創(chuàng)新互聯(lián)主營南芬網(wǎng)站建設(shè)的網(wǎng)絡(luò)公司,主營網(wǎng)站建設(shè)方案,成都app軟件開發(fā)公司,南芬h5小程序開發(fā)搭建,南芬網(wǎng)站營銷推廣歡迎南芬等地區(qū)企業(yè)咨詢

1、安裝 requests 庫

因?yàn)閷W(xué)習(xí)過程使用的是 Python 語言，需要提前安裝 Python ，我安裝的是 Python 3.8，可以通過命令 python --version 查看自己安裝的 Python 版本，建議安裝 Python 3.X 以上的版本。

安裝好 Python 以后可以直接通過以下命令安裝 requests 庫。

pip install requests

Ps：可以切換到國內(nèi)的pip源，例如阿里、豆瓣，速度快
為了演示功能，我這里使用nginx模擬了一個(gè)簡單網(wǎng)站。
下載好了以后，直接運(yùn)行根目錄下的 nginx.exe 程序就可以了(備注：windows環(huán)境下)。
這時(shí)本機(jī)訪問：http://127.0.0.1 ，會進(jìn)入 nginx 的一個(gè)默認(rèn)頁面。

Python爬蟲Requests庫如何使用

2、獲取網(wǎng)頁

下面我們開始用 requests 模擬一個(gè)請求，獲取頁面源代碼。

import requestsr = requests.get('http://127.0.0.1')print(r.text)

執(zhí)行以后得到的結(jié)果如下：

Welcome to nginx!

If you see this page, the nginx web server is successfully installed andworking. Further configuration is required.

For online documentation and support please refer tonginx.org.
Commercial support is available atnginx.com.

Thank you for using nginx.

3、關(guān)于請求

常見的請求有很多種，比如上面的示例使用的就是 GET 請求，這里詳細(xì)介紹一下這些常見的請求方法。

4、GET 請求

4.1、發(fā)起請求

我們使用相同的方法，發(fā)起一個(gè) GET 請求：

import requests  r = requests.get('http://httpbin.org/get')  print(r.text)

返回結(jié)果如下：

{"args": {}, "headers": {"Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.23.0", "X-Amzn-Trace-Id": "Root=1-5f846520-19f215aa46213a2b4241c18a"  }, "origin": "xxxx", "url": "http://httpbin.org/get"}

通過返回結(jié)果，我們可以看到返回結(jié)果所包括的信息有：Headers、URL、IP等。

4.2、添加參數(shù)

平時(shí)我們訪問的 URL 會包含一些參數(shù)，比如：id是100，name是YOOAO。正常的訪問，我們會編寫如下 URL 進(jìn)行訪問：

http://httpbin.org/get?id=100&name=YOOAO

顯然很不方便，而且參數(shù)多的情況下會容易出錯(cuò)，這時(shí)我們可以通過 params 參數(shù)優(yōu)化輸入內(nèi)容。

import requests  data = {      'id': '100',      'name': 'YOOAO'}  r = requests.get('http://httpbin.org/get', params=data)  print(r.text)

這是執(zhí)行代碼返回的結(jié)果如下：

{"args": {"id": "100", "name": "YOOAO"  }, "headers": {"Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.23.0", "X-Amzn-Trace-Id": "Root=1-5f84658a-1cd0437b4cf34835410d7161"  }, "origin": "xxx.xxxx.xxx.xxx", "url": "http://httpbin.org/get?id=100&name=YOOAO"}

通過返回結(jié)果，我們可以看到，通過字典方式傳輸?shù)膮?shù)被自動構(gòu)造成了完整的 URL ，不需要我們自己手動完成構(gòu)造。

4.3、返回結(jié)果處理

返回結(jié)果是 json 格式，因此我們可以使用調(diào)用 json 的方法來解析。如果返回內(nèi)容不是 json 格式，這種調(diào)用會報(bào)錯(cuò)。

import requests  
r = requests.get('http://httpbin.org/get')  print(type(r.text))   print(type(r.json()))

返回結(jié)果：

4.4、內(nèi)容抓取

這里我們使用簡單的正則表達(dá)式，來抓取nginx示例頁面種所有< a >標(biāo)簽的內(nèi)容，代碼如下：

import requestsimport re
r = requests.get('http://127.0.0.1')pattern = re.compile('(.*?)', re.S)a_content = re.findall(pattern, r.text)print(a_content)

抓取結(jié)果：

['nginx.org', 'nginx.com']

這里一次簡單的頁面獲取和內(nèi)容抓取就完成了，

4.5、數(shù)據(jù)文件下載

上面的示例，返回的都是頁面信息，如果我們想獲取網(wǎng)頁上的圖片、音頻和視頻文件，我們就需要學(xué)會抓取頁面的二進(jìn)制數(shù)據(jù)。我們可以使用 open 方法來完成圖片等二進(jìn)制文件的下載，示例代碼：

import requests
r = requests.get('http://tu.ossfiles.cn:9186/group3/M00/09/FB/rBpVfl8QFLOAYhhcAAC-pTdNj7g471.jpg')with open('image.jpg', 'wb') as f:    f.write(r.content)print('下載完成')

open 方法中，它的第一個(gè)參數(shù)是文件名稱，第二個(gè)參數(shù)代表以二進(jìn)制的形式打開，可以向文件里寫入二進(jìn)制數(shù)據(jù)。

運(yùn)行結(jié)束以后，會在運(yùn)行文件的同級文件夾下保存下載下來的圖片。運(yùn)用同樣原理，我們可以處理視頻和音頻文件。

4.6、添加headers

在上面的示例中，我們直接發(fā)起的請求，沒有添加 headers ，某些網(wǎng)站為因?yàn)檎埱蟛粩y帶請求頭而造成訪問異常，這里我們可以手動添加 headers 內(nèi)容，模擬添加 headers 中的 Uer-Agent 內(nèi)容代碼：

import requests
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36'}r = requests.get('http://httpbin.org/get', headers=headers)print(r.text)

執(zhí)行結(jié)果：

{"args": {}, "headers": {"Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36", "X-Amzn-Trace-Id": "Root=1-5ec8f342-8a9f986011eac8f07be8b450"  }, "origin": "xxx3.xx.xxx.xxx", "url": "http://httpbin.org/get"}

結(jié)果可見，User-Agent 的值變了。不是之前的：python-requests/2.23.0。

5、POST 請求

GET請求相關(guān)的知識都講完了，下面講講另一個(gè)常見的請求方式：POST請求。

使用 requests 實(shí)現(xiàn) POST 請求的代碼如下:

import requestsdata = {      'id': '100',      'name': 'YOOAO'}  
r = requests.post("http://httpbin.org/post", data=data)print(r.text)

結(jié)果如下

{"args": {}, "data": "", "files": {}, "form": {"id": "100", "name": "YOOAO"  }, "headers": {"Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "17", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "python-requests/2.23.0", "X-Amzn-Trace-Id": "Root=1-5ec8f4a0-affca27a05e320a84ca6535a"  }, "json": null, "origin": "xxxx", "url": "http://httpbin.org/post"}

從 form 中我們看到了自己提交的數(shù)據(jù)，可見我們的 POST 請求訪問成功。

6、響應(yīng)

訪問URL時(shí)，有請求就會有響應(yīng)，上面的示例使用 text 和 content 獲取了響應(yīng)的內(nèi)容。除此以外，還有很多屬性和方法可以用來獲取其他信息，比如狀態(tài)碼、響應(yīng)頭、Cookies 等。

import requests
r = requests.get('http://127.0.0.1/')print(type(r.status_code), r.status_code)print(type(r.headers), r.headers)print(type(r.cookies), r.cookies)print(type(r.url), r.url)print(type(r.history), r.history)

關(guān)于狀態(tài)碼，requests 還提供了一個(gè)內(nèi)置的狀態(tài)碼查詢對象 requests.codes，用法示例如下：

import requestsr = requests.get('http://127.0.0.1/')exit() if not r.status_code == requests.codes.ok else print('Request Successfully')==========執(zhí)行結(jié)果==========Request Successfully

這里通過比較返回碼和內(nèi)置的成功的返回碼，來保證請求得到了正常響應(yīng)，輸出成功請求的消息，否則程序終止。

這里我們用 requests.codes.ok 得到的是成功的狀態(tài)碼 200。

這樣的話，我們就不用再在程序里面寫狀態(tài)碼對應(yīng)的數(shù)字了，用字符串表示狀態(tài)碼會顯得更加直觀。

下面是響應(yīng)碼和查詢條件對照信息：

# 信息性狀態(tài)碼  100: ('continue',),  101: ('switching_protocols',),  102: ('processing',),  103: ('checkpoint',),  122: ('uri_too_long', 'request_uri_too_long'),  
# 成功狀態(tài)碼  200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\\o/', '?'),  201: ('created',),  202: ('accepted',),  203: ('non_authoritative_info', 'non_authoritative_information'),  204: ('no_content',),  205: ('reset_content', 'reset'),  206: ('partial_content', 'partial'),  207: ('multi_status', 'multiple_status', 'multi_stati', 'multiple_stati'),  208: ('already_reported',),  226: ('im_used',),  
# 重定向狀態(tài)碼  300: ('multiple_choices',),  301: ('moved_permanently', 'moved', '\\o-'),  302: ('found',),  303: ('see_other', 'other'),  304: ('not_modified',),  305: ('use_proxy',),  306: ('switch_proxy',),  307: ('temporary_redirect', 'temporary_moved', 'temporary'),  308: ('permanent_redirect',        'resume_incomplete', 'resume',), # These 2 to be removed in 3.0  
# 客戶端錯(cuò)誤狀態(tài)碼  400: ('bad_request', 'bad'),  401: ('unauthorized',),  402: ('payment_required', 'payment'),  403: ('forbidden',),  404: ('not_found', '-o-'),  405: ('method_not_allowed', 'not_allowed'),  406: ('not_acceptable',),  407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'),  408: ('request_timeout', 'timeout'),  409: ('conflict',),  410: ('gone',),  411: ('length_required',),  412: ('precondition_failed', 'precondition'),  413: ('request_entity_too_large',),  414: ('request_uri_too_large',),  415: ('unsupported_media_type', 'unsupported_media', 'media_type'),  416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),  417: ('expectation_failed',),  418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),  421: ('misdirected_request',),  422: ('unprocessable_entity', 'unprocessable'),  423: ('locked',),  424: ('failed_dependency', 'dependency'),  425: ('unordered_collection', 'unordered'),  426: ('upgrade_required', 'upgrade'),  428: ('precondition_required', 'precondition'),  429: ('too_many_requests', 'too_many'),  431: ('header_fields_too_large', 'fields_too_large'),  444: ('no_response', 'none'),  449: ('retry_with', 'retry'),  450: ('blocked_by_windows_parental_controls', 'parental_controls'),  451: ('unavailable_for_legal_reasons', 'legal_reasons'),  499: ('client_closed_request',),  
# 服務(wù)端錯(cuò)誤狀態(tài)碼  500: ('internal_server_error', 'server_error', '/o\\', '?'),  501: ('not_implemented',),  502: ('bad_gateway',),  503: ('service_unavailable', 'unavailable'),  504: ('gateway_timeout',),  505: ('http_version_not_supported', 'http_version'),  506: ('variant_also_negotiates',),  507: ('insufficient_storage',),  509: ('bandwidth_limit_exceeded', 'bandwidth'),  510: ('not_extended',),  511: ('network_authentication_required', 'network_auth', 'network_authentication')

7、SSL 證書驗(yàn)證

現(xiàn)在很多網(wǎng)站都會驗(yàn)證證書，我們可以設(shè)置參數(shù)來忽略證書的驗(yàn)證。

import requests
response = requests.get('https://XXXXXXXX', verify=False)print(response.status_code)

或者制定本地證書作為客戶端證書：

import requests
response = requests.get('https://xxxxxx', cert=('/path/server.crt', '/path/server.key'))print(response.status_code)

注意：本地私有證書的 key 必須是解密狀態(tài)，加密狀態(tài)的 key 是不支持的。

8、設(shè)置超時(shí)

很多時(shí)候我們需要設(shè)置超時(shí)時(shí)間來控制訪問的效率，遇到訪問慢的鏈接直接跳過。

示例代碼：

import requests# 設(shè)置超時(shí)時(shí)間為 10 秒r = requests.get('https://httpbin.org/get', timeout=10)print(r.status_code)

將連接時(shí)間和讀取時(shí)間分開計(jì)算：

r = requests.get('https://httpbin.org/get', timeout=(3, 10))

不添加參數(shù)，默認(rèn)不設(shè)置超時(shí)時(shí)間，等同于：

r = requests.get('https://httpbin.org/get', timeout=None)

9、身份認(rèn)證

遇到一些網(wǎng)站需要輸入用戶名和密碼，我們可以通過 auth 參數(shù)進(jìn)行設(shè)置。

import requests  from requests.auth import HTTPBasicAuth  # 用戶名為 admin ，密碼為 admin r = requests.get('https://xxxxxx/', auth=HTTPBasicAuth('admin', 'admin'))  print(r.status_code)

簡化寫法：

import requests
r = requests.get('https://xxxxxx', auth=('admin', 'admin'))print(r.status_code)

10、設(shè)置代理

如果頻繁的訪問某個(gè)網(wǎng)站時(shí)，后期會被一些反爬程序識別，要求輸入驗(yàn)證信息，或者其他信息，甚至IP被封無法再次訪問，這時(shí)候，我們可以通過設(shè)置代理來避免這樣的問題。

import requests
proxies = {  "http": "http://10.10.1.10:3128",  "https": "http://10.10.1.10:1080",}
requests.get("http://example.org", proxies=proxies)

若你的代理需要使用HTTP Basic Auth，可以使用

http://user:password@host/ 語法：

proxies = {    "http": "http://user:pass@10.10.1.10:3128/",}

要為某個(gè)特定的連接方式或者主機(jī)設(shè)置代理，使用 scheme://hostname 作為 key，它會針對指定的主機(jī)和連接方式進(jìn)行匹配。

proxies = {'http://10.20.1.128': 'http://10.10.1.10:5323'}

到此，相信大家對“Python爬蟲Requests庫如何使用”有了更深的了解，不妨來實(shí)際操作一番吧！這里是創(chuàng)新互聯(lián)網(wǎng)站，更多相關(guān)內(nèi)容可以進(jìn)入相關(guān)頻道進(jìn)行查詢，關(guān)注我們，繼續(xù)學(xué)習(xí)！

網(wǎng)頁標(biāo)題：Python爬蟲Requests庫如何使用
URL標(biāo)題：http://weahome.cn/article/jdcoip.html

真实的国产乱ⅩXXX66竹夫人,五月香六月婷婷激情综合,亚洲日本VA一区二区三区,亚洲精品一区二区三区麻豆

Python爬蟲Requests庫如何使用

Welcome to nginx!

其他資訊

網(wǎng)站制作

企業(yè)服務(wù)

網(wǎng)站建設(shè)

服務(wù)器托管