一篇文章搞定Scrapy爬蟲(chóng)框架-創(chuàng)新互聯(lián)

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??Scrapy框架

10年積累的網(wǎng)站設(shè)計(jì)制作、成都做網(wǎng)站經(jīng)驗(yàn)，可以快速應(yīng)對(duì)客戶(hù)對(duì)網(wǎng)站的新想法和需求。提供各種問(wèn)題對(duì)應(yīng)的解決方案。讓選擇我們的客戶(hù)得到更好、更有力的網(wǎng)絡(luò)服務(wù)。我雖然不認(rèn)識(shí)你，你也不認(rèn)識(shí)我。但先網(wǎng)站制作后付款的網(wǎng)站建設(shè)流程，更有洛浦免費(fèi)網(wǎng)站建設(shè)讓你可以放心的選擇與我們合作。

Scrapy是用Python實(shí)現(xiàn)的一個(gè)為了爬取網(wǎng)站數(shù)據(jù)，提取結(jié)構(gòu)性數(shù)據(jù)而編寫(xiě)的應(yīng)用框架?？梢詰?yīng)用在包括數(shù)據(jù)挖掘、信息處理或存儲(chǔ)歷史數(shù)據(jù)等一系列的程序中。

Scrapy使用Twisted基于事件的高效異步網(wǎng)絡(luò)框架來(lái)處理網(wǎng)絡(luò)通信，可以加快下載速度，不用自己去實(shí)現(xiàn)異步框架，并且包含了各種中間件接口，可以靈活的完成各種需求。

Scrapy架構(gòu)

一篇文章搞定 Scrapy 爬蟲(chóng)框架

Scrapy Engine

引擎，負(fù)責(zé)控制數(shù)據(jù)流在系統(tǒng)中所有組件中流動(dòng)，并在相應(yīng)動(dòng)作發(fā)生時(shí)觸發(fā)事件。此組件相當(dāng)于爬蟲(chóng)的“大腦”，是整個(gè)爬蟲(chóng)的調(diào)度中心

調(diào)度器(Scheduler)

調(diào)度器接收從引擎發(fā)送過(guò)來(lái)的request，并將他們?nèi)腙?duì)，以便之后引擎請(qǐng)求他們時(shí)提供給引擎
初始的爬取URL和后續(xù)在頁(yè)面中獲取的待爬取的URL將放入調(diào)度器中，等待爬取。同時(shí)調(diào)度器會(huì)自動(dòng)去除重復(fù)的URL（如果特定的URL不需要去重也可以通過(guò)設(shè)置實(shí)現(xiàn)，如post請(qǐng)求的URL）

下載器(Downloader)

下載器負(fù)責(zé)獲取頁(yè)面數(shù)據(jù)并提供給引擎，而后提供給spider

Spiders爬蟲(chóng)

Spider是編寫(xiě)的類(lèi)，作用如下：

Scrapy用戶(hù)編寫(xiě)用于分析response并提取item(即獲取到的item)
額外跟進(jìn)的URL，將額外跟進(jìn)的URL提交給引擎，加入到Scheduler調(diào)度器中。將每個(gè)spider負(fù)責(zé)處理一個(gè)特定(或一些)網(wǎng)站

Item Pipeline

Item Pipeline負(fù)責(zé)處理被spider提取出來(lái)的item。典型的處理有清理、驗(yàn)證及持久化(例如存取到數(shù)據(jù)庫(kù)中)
當(dāng)頁(yè)面被爬蟲(chóng)解析所需的數(shù)據(jù)存入Item后，將被發(fā)送到項(xiàng)目管道(Pipeline)，并經(jīng)過(guò)設(shè)置好次序的pipeline程序處理這些數(shù)據(jù)，最后將存入本地文件或存入數(shù)據(jù)庫(kù)
類(lèi)似管道 $ ls | grep test 或者類(lèi)似于Django 模板中的過(guò)濾器

以下是item pipeline的一些典型應(yīng)用：

清理HTML數(shù)據(jù)
驗(yàn)證爬取的數(shù)據(jù)(檢查item包含某些字段)
查重(或丟棄)
將爬取結(jié)果保存到數(shù)據(jù)庫(kù)中

下載器中間件(Downloader middlewares)

簡(jiǎn)單講就是自定義擴(kuò)展下載功能的組件。

下載器中間件，是在引擎和下載器之間的特定鉤子(specific hook)，處理它們之間的請(qǐng)求request和響應(yīng)response。

? ?它提供了一個(gè)簡(jiǎn)便的機(jī)制，通過(guò)插入自定義代碼來(lái)擴(kuò)展Scrapy功能

通過(guò)設(shè)置下載器中間件可以實(shí)現(xiàn)爬蟲(chóng)自動(dòng)更換user-agent、IP等功能

Spider中間件(Spider middlewares)

Spider中間件，是在引擎和Spider之間的特定鉤子(specific hook)，處理spider的輸入(response)和輸出(items或requests)。

也提供了同樣的簡(jiǎn)便機(jī)制，通過(guò)插入自定義代碼來(lái)擴(kuò)展Scrapy功能。

數(shù)據(jù)流(Data flow)

1.引擎打開(kāi)一個(gè)網(wǎng)站(open a domain)，找到處理該網(wǎng)站的Spider并向該spider請(qǐng)求第一個(gè)（批）要爬取的URL(s)

2.引擎從Spider中獲取到第一個(gè)要爬取的URL并加入到調(diào)度器(Scheduler)作為請(qǐng)求以備調(diào)度

3.引擎向調(diào)度器請(qǐng)求下一個(gè)要爬取的URL

4.調(diào)度器返回下一個(gè)要爬取的URL給引擎，引擎將URL通過(guò)下載中間件并轉(zhuǎn)發(fā)給下載器(Downloader)

5.一旦頁(yè)面下載完畢，下載器生成一個(gè)該頁(yè)面的Response，并將其通過(guò)下載中間件發(fā)送給引擎

6.引擎從下載器中接收到Response，然后通過(guò)Spider中間件發(fā)送給Spider處理

7.Spider處理Response并返回提取到的Item及(跟進(jìn)的)新的Request給引擎

8.引擎將Spider返回的Item交給Item Pipeline，將Spider返回的Request交給調(diào)度器

9.(從第二步)重復(fù)執(zhí)行，直到調(diào)度器中沒(méi)有待處理的request，引擎關(guān)閉

注意：

只有當(dāng)調(diào)度器中沒(méi)有任何request了，整個(gè)程序才會(huì)停止執(zhí)行。如果有下載失敗的URL，會(huì)重新下載

安裝scrapy

安裝wheel支持

? $ pip install wheel

安裝scrapy框架

? $ pip install scrapy

window下，為了避免windows編譯安裝twisted依賴(lài)，安裝下面的二進(jìn)制包

? $ pip install Twisted-18.4.0-cp35-cp35m-win_amd64.whl

windows下出現(xiàn)如下問(wèn)題：

copying?src\twisted\words\xish\xpathparser.g?->?build\lib.win-amd64-3.5\twisted\words\xish?running?build_ext?building?'twisted.test.raiser'?extension?error:?Microsoft?Visual?C++?14.0?is?required.?Get?it?with?"Microsoft?Visual?C++?Build?Tools":?http://landinghub.visualstudio.com/visual-cpp-build-tools解決方案是，下載編譯好的twisted，Python?Extension?Packages?for?Windowspython3.5?下載?Twisted-18.4.0-cp35-cp35m-win_amd64.whlpython3.6?下載?Twisted-18.4.0-cp36-cp36m-win_amd64.whl安裝twisted$?pip?install?Twisted-18.4.0-cp35-cp35m-win_amd64.whl之后在安裝scrapy就沒(méi)有什么問(wèn)題了

安裝好，使用scrapy命令看看

??1.>?scrapy??2.Scrapy?1.5.0?-?no?active?project??3.??4.Usage:??5.?scrapy??[options]?[args]??6.??7.Available?commands:??8.bench?Run????quick?benchmark?test??9.check???????Check?spider?contracts??10.crawl???????Run?a?spider??11.edit???????Edit?spider??12.fetch???????Fetch?a?URL?using?the?Scrapy?downloader??13.genspider?????Generate?new?spider?using?pre-defined?templates??14.list???????List?available?spiders??15.parse???????Parse?URL?(using?its?spider)?and?print?the?results??16.runspider?????Run?a?self-contained?spider?(without?creating?a?project)??17.settings?????Get?settings?values??18.shell???????Interactive?scraping?console??19.startproject???Create?new?project??20.version?????Print?Scrapy?version??21.view???????Open?URL?in?browser,?as?seen?by?Scrapy

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Scrapy開(kāi)發(fā)

項(xiàng)目編寫(xiě)流程

1.創(chuàng)建項(xiàng)目

? 使用 scrapy startproject proname 創(chuàng)建一個(gè)scrapy項(xiàng)目

? scrapy startproject [project_dir]

2.編寫(xiě)item

? 在items.py中編寫(xiě)Item類(lèi)，明確從response中提取的item

3.編寫(xiě)爬蟲(chóng)

? 編寫(xiě)spiders/proname_spider.py，即爬取網(wǎng)站的spider并提取出item

4.編寫(xiě)item pipeline

? item的處理，可以存儲(chǔ)

1 創(chuàng)建項(xiàng)目

1.1 豆瓣書(shū)評(píng)爬取

標(biāo)簽為“編程”，第一頁(yè)、第二頁(yè)鏈接：

https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=0&type=T

https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=20&type=T

隨便找一個(gè)目錄來(lái)創(chuàng)建項(xiàng)目，執(zhí)行下面命令

$ scrapy startproject first .

會(huì)產(chǎn)生如下目錄和文件

first?├─?scrapy.cfg?└─?first???├─?items.py???├─?middlewares.py???├─?pipelines.py???├─?settings.py???├─?__init__.py???└─?spiders?????└─?__init__.py

first：

? 外部的first目錄是整個(gè)項(xiàng)目目錄，內(nèi)部的first目錄是整個(gè)項(xiàng)目的全局目錄

scrapy.cfg：

? 必須有的重要的項(xiàng)目的配置文件

first 項(xiàng)目目錄
__init__.py 必須有，包文件
items.py 定義Item類(lèi)，從scrapy.Item繼承，里面定義scrapy.Field類(lèi)實(shí)例
pipelines.py 重要的是process_item()方法，處理item
settings.py：
BOT_NAME 爬蟲(chóng)名
ROBOTSTXT_OBEY = True 是否遵從robots協(xié)議
USER_AGENT = '' 指定爬取時(shí)使用
CONCURRENT_REQEUST = 16 默認(rèn)16個(gè)并行
DOWNLOAD_DELAY = 3 下載延時(shí)，一般要設(shè)置，不宜過(guò)快發(fā)起連續(xù)請(qǐng)求
COOKIES_ENABLED = False 缺省是啟用，一般需要登錄時(shí)才需要開(kāi)啟cookie
SPIDER_MIDDLEWARES 爬蟲(chóng)中間件
DOWNLOADER_MIDDLEWARES 下載中間件

? 'firstscrapy.pipelines.FirstscrapyPipeline': 300item交給哪一個(gè)管道處理，300 越小優(yōu)先 ?

? ?級(jí)越高

? ?ITEM_PIPELINES 管道配置

? 'first.middlewares.FirstDownloaderMiddleware': 543543 越小優(yōu)先級(jí)越高

? __init__.py 必須有，可以在這里寫(xiě)爬蟲(chóng)類(lèi)，也可以寫(xiě)爬蟲(chóng)子模塊

1.#?first/settings.py參考2.BOT_NAME?=?'first'3.SPIDER_MODULES?=?['first.spiders']4.NEWSPIDER_MODULE?=?'first.spiders'5.6.USER_AGENT?=?"Mozilla/5.0?(Windows?NT?6.1)AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/55.0.2883.75?Safari/537.36"7.ROBOTSTXT_OBEY?=?False8.9.DOWNLOAD_DELAY?=?310.11.#?Disable?cookies?(enabled?by?default)12.COOKIES_ENABLED?=?False

注意一定要更改User-Agent，否則訪問(wèn)https://book.douban.com/會(huì)返回403

2 編寫(xiě)Item

??1.在items.py中編寫(xiě)??2.import?scrapy??3.class?BookItem(scrapy.Item):??4.title?=?scrapy.Field()?#?書(shū)名??5.rate?=?scrapy.Field()?#?評(píng)分

3 編寫(xiě)爬蟲(chóng)

為爬取豆瓣書(shū)評(píng)編寫(xiě)爬蟲(chóng)類(lèi)，在spiders目錄下：

編寫(xiě)的爬蟲(chóng)類(lèi)需要繼承自scrapy.Spider，在這個(gè)類(lèi)中定義爬蟲(chóng)名、爬取范圍、其實(shí)地址等
在scrapy.Spider中parse方法未實(shí)現(xiàn)，所以子類(lèi)應(yīng)該實(shí)現(xiàn)parse方法。該方法傳入response對(duì)象

??1.#?scrapy源碼中??2.class?Spider():??3.def?parse(self,?response):?#?解析返回的內(nèi)容??4.raise?NotImplementedError

爬取讀書(shū)頻道，tag為“編程”的書(shū)名和評(píng)分：

https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=20&type=T

使用模板創(chuàng)建spider， $ scrapy genspider -t basic book https://www.douban.com/

??1.import?scrapy??2.??3.class?BookSpider(scrapy.Spider):?#?BookSpider???4.name?=?'doubanbook'?#?爬蟲(chóng)名，可修改，重要???5.allowed_domains?=?['豆瓣']?#?爬蟲(chóng)爬取范圍???6.url?=?'豆瓣圖書(shū)標(biāo)簽:?編程'???7.start_urls?=?[url]?#?起始URL???8.??9.#?下載器獲取了WEB?Server的response就行了，parse就是解析響應(yīng)的內(nèi)容???10.def?parse(self,?response):???11.?print(type(response),?'~~~~~~~~~')?#scrapy.http.response.html.HtmlResponse???12.print(response)???13.print('-'?*?30)

使用crawl爬取子命令

???1.$?scrapy?list???2.$?scrapy?crawl?-h???3.scrapy?crawl?[options]????4.???5.指定爬蟲(chóng)名稱(chēng)開(kāi)始爬取???6.$?scrapy?crawl?doubanbook???7.???8.可以不打印日志???9.$?scrapy?crawl?doubanbook?--nolog

如果在windows下運(yùn)行發(fā)生twisted的異常 ModuleNotFoundError: No module named 'win32api' ，請(qǐng)安裝 $ pip install pywin32。

response是服務(wù)器端HTTP響應(yīng)，它是scrapy.http.response.html.HtmlResponse類(lèi)。

由此，修改代碼如下

???1.import?scrapy???2.from?scrapy.http.response.html?import?HtmlResponse???3.???4.class?BookSpider(scrapy.Spider):?#?BookSpider???5.?name?=?'doubanbook'?#?爬蟲(chóng)名????6.?allowed_domains?=?['豆瓣']?#?爬蟲(chóng)爬取范圍????7.?url?=?'豆瓣圖書(shū)標(biāo)簽:?編程'????8.start_urls?=?[url]?#?起始URL????9.???10.?#?下載器獲取了WEB?Server的response就行了，parse就是解析響應(yīng)的內(nèi)容????11.def?parse(self,?response:HtmlResponse):????12.?print(type(response))?#scrapy.http.response.html.HtmlResponse????13.?print('-'*30)????14.?print(type(response.text),?type(response.body))???15.print('-'*30)???16.print(response.encoding)???17.with?open('o:/testbook.html',?'w',?encoding='utf-8')?as?f:???18.?try:????19.?f.write(response.text)????20.?f.flush()????21.?except?Exception?as?e:????22.print(e)

3.1 解析HTML

爬蟲(chóng)獲得的內(nèi)容response對(duì)象，可以使用解析庫(kù)來(lái)解析。

scrapy包裝了lxml，父類(lèi)TextResponse類(lèi)也提供了xpath方法和css方法，可以混合使用這兩套接口解析HTML。

選擇器參考：

https://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/selectors.html#id3

??1.import?scrapy??2.from?scrapy.http.response.html?import?HtmlResponse??3.??4.response?=?HtmlResponse('file:///O:/testbook.html',?encoding='utf-8')?#?構(gòu)造對(duì)象??5.??6.with?open('o:/testbook.html',?encoding='utf8')?as?f:???7.response._set_body(f.read())?#?填充數(shù)據(jù)???8.#print(response.text)
??9.??1O.#?獲取所有標(biāo)題及評(píng)分??11.#?xpath解析???12.subjects?=?response.xpath('//li[@class="subject-item"]')???13.for?subject?in?subjects:???14.title?=?subject.xpath('.//h3/a/text()').extract()?#?list???15.print(title[0].strip())??16.???17.rate?=?subject.xpath('.//span[@class="rating_nums"]/text()').extract()??18.print(rate[0].strip())???19.??2O.print('-'*30)???21.#?css解析???22.subjects?=?response.css('li.subject-item')???23.for?subject?in?subjects:???24.title?=?subject.css('h3?a::text').extract()???25.print(title[0].strip())???26.??27.rate?=?subject.css('span.rating_nums::text').extract()???28.print(rate[0].strip())???29.print('-'*30)??30.???31.?#?xpath和css混合使用、正則表達(dá)式匹配???32.subjects?=?response.css('li.subject-item')???33.for?subject?in?subjects:??34.#?提取鏈接??35.href?=subject.xpath('.//h3').css('a::attr(href)').extract()??36.print(href[0])??37.??38.?#?使用正則表達(dá)式??39.id?=?subject.xpath('.//h3/a/@href').re(r'\d*99\d*')??40.if?id:???41.print(id[0])???42.??43.#?要求顯示9分以上數(shù)據(jù)???44.rate?=?subject.xpath('.//span[@class="rating_nums"]/text()').re(r'^9.*')???45.#?rate?=?subject.css('span.rating_nums::text').re(r'^9\..*')???46.if?rate:???47.print(rate)

3.2 item封裝數(shù)據(jù)

???1.#?spiders/bookspider.py???2.import?scrapy???3.from?scrapy.http.response.html?import?HtmlResponse???4.from?..items?import?BookItem???5.???6.class?BookSpider(scrapy.Spider):?#?BookSpider???7.name?=?'doubanbook'?#?爬蟲(chóng)名????8.allowed_domains?=?['豆瓣']?#?爬蟲(chóng)爬取范圍????9.url?=?'豆瓣圖書(shū)標(biāo)簽:?編程'???10.start_urls?=?[url]?#?起始URL???11.???12.?#?下載器獲取了WEB?Server的response就行了，parse就是解析響應(yīng)的內(nèi)容????13.def?parse(self,?response:HtmlResponse):????14.items?=?[]????15.#?xpath解析????16.subjects?=?response.xpath('//li[@class="subject-item"]')????17.for?subject?in?subjects:????18.title?=?subject.xpath('.//h3/a/text()').extract()????????19.rate?=?subject.xpath('.//span[@class="rating_nums"]/text()').extract_first()????????20.item?=?BookItem()????????21.item['title']?=?title[0].strip()????????22.item['rate']?=?rate.strip()????????23.?items.append(item)???24.?????25.?print(items)???26.???27.return?items?#?一定要return，否則保存不下來(lái)???28.????29.#?使用命令保存return的數(shù)據(jù)???30.#?scrapy?crawl?-h???31.#?--output=FILE,?-o?FILE?dump?scraped?items?into?FILE?(use?-?for?stdout)???32.#?文件擴(kuò)展名支持'json',?'jsonlines',?'jl',?'csv',?'xml',?'marshal',?'pickle'???33.#?scrapy?crawl?doubanbook?-o?dbbooks.json

得到下圖數(shù)

一篇文章搞定 Scrapy 爬蟲(chóng)框架

注意上圖的數(shù)據(jù)已經(jīng)是unicode字符，漢字的unicode表達(dá)。

4 pipeline處理

將bookspider.py中BookSpider改成生成器，只需要把 return items 改造成 yield item ，即由產(chǎn)生一個(gè)列表變成yield一個(gè)個(gè)item

腳手架幫我們創(chuàng)建了一個(gè)pipelines.py文件和一個(gè)類(lèi)

4.1 開(kāi)啟pipeline

??1.#?Configure?item?pipelines??2.#?See?Item?Pipeline?-?Scrapy?1.8.0?documentation??3.ITEM_PIPELINES?=?{??4.'first.pipelines.FirstPipeline':?300,??5.}

整數(shù)300表示優(yōu)先級(jí)，越小越高。

取值范圍為0-1000

4.2常用方法

一篇文章搞定 Scrapy 爬蟲(chóng)框架

???1.class?FirstPipeline(object):????2.def?__init__(self):?#?全局設(shè)置????3.?print('~~~~~~~~~~?init?~~~~~~~~~~~~')????4.????5.def?open_spider(self,?spider):?#?當(dāng)某spider開(kāi)啟時(shí)調(diào)用????6.?print(spider,'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')????7.???8.def?process_item(self,?item,?spider):????9.?#?item?獲取的item；spider?獲取該item的spider????10.return?item????11.???12.def?close_spider(self,?spider):?#?當(dāng)某spider關(guān)閉時(shí)調(diào)用????13.print(spider,'========================================')???14.

需求

通過(guò)pipeline將爬取的數(shù)據(jù)存入json文件中

???1.#?spider/bookspider.py???2.import?scrapy???3.from?scrapy.http.response.html?import?HtmlResponse???4.from?..items?import?BookItem???5.???6.class?BookSpider(scrapy.Spider):?#?BookSpider???7.?name?=?'doubanbook'?#?爬蟲(chóng)名????8.?allowed_domains?=?['豆瓣']?#?爬蟲(chóng)爬取范圍????9.url?=?'豆瓣圖書(shū)標(biāo)簽:?編程'????10.?start_urls?=?[url]?#?起始URL???11.???12.#?spider上自定義配置信息????13.custom_settings?=?{????14.?'filename'?:?'o:/books.json'???15.?}???16.#?下載器獲取了WEB?Server的response就行了，parse就是解析響應(yīng)的內(nèi)容???17.def?parse(self,?response:HtmlResponse):???18.?#items?=?[]????19.#?xpath解析????20.subjects?=?response.xpath('//li[@class="subject-item"]')????21.for?subject?in?subjects:???22.title?=?subject.xpath('.//h3/a/text()').extract()????23.rate?=subject.xpath('.//span[@class="rating_nums"]/text()').extract_first()????24.item?=?BookItem()????25.item['title']?=?title[0].strip()????26.item['rate']?=?rate.strip()????27.#items.append(item)???28.???29.yield?item????30.#return?items???31.???32.#?pipelines.py
???33.import?simplejson?as?json
???34.
???35.class?FirstPipeline(object):????36.?def?__init__(self):?#?全局設(shè)置????37.?print('~~~~~~~~~~?init?~~~~~~~~~~~~')???38.???39.def?open_spider(self,?spider):?#?當(dāng)某spider開(kāi)啟時(shí)調(diào)用????40.?print('{}?~~~~~~~~~~~~~~~~~~~~'.format(spider))????41.?print(spider.settings.get('filename'))????42.self.file?=?open(spider.settings['filename'],?'w',?encoding='utf-8')????43.self.file.write('[\n')???44.????45.def?process_item(self,?item,?spider):????46.#?item?獲取的item；spider?獲取該item的spider????47.self.file.write(json.dumps(dict(item))?+?',\n')????48.return?item???49.???50.def?close_spider(self,?spider):?#?當(dāng)某spider關(guān)閉時(shí)調(diào)用???51.self.file.write(']')????52.self.file.close()????53.print('{}?======================='.format(spider))????54.print('-'*30)

5 url提取

如果要爬取下一頁(yè)內(nèi)容，可以自己分析每一頁(yè)的頁(yè)碼變化，也可以通過(guò)提取分頁(yè)欄的鏈接

???1.#?spider/bookspider.py???2.import?scrapy???3.from?scrapy.http.response.html?import?HtmlResponse???4.from?..items?import?BookItem???5.???6.class?BookSpider(scrapy.Spider):?#?BookSpider????7.name?=?'doubanbook'?#?爬蟲(chóng)名???8.allowed_domains?=?['豆瓣']?#?爬蟲(chóng)爬取范圍???9.url?=?'豆瓣圖書(shū)標(biāo)簽:?編程'????10.start_urls?=?[url]?#?起始URL???11.???12.#?spider上自定義配置信息????13.custom_settings?=?{??????14.'filename'?:?'o:/books.json'???15.}???16.???17.#?下載器獲取了WEB?Server的response就行了，parse就是解析響應(yīng)的內(nèi)容??????18.def?parse(self,?response:HtmlResponse):????19.#items?=?[]??????20.#?xpath解析??????21.#?獲取下一頁(yè)，只是測(cè)試，所以使用re來(lái)控制頁(yè)碼????22.print('-'?*?30)???23.urls?=?response.xpath('//div[@class="paginator"]/span[@class="next"]/a/@href').re(???24.??????????r'.*start=[24]\d[^\d].*')????25.print(urls)???26.print('-'?*?30)???27.yield?from?(scrapy.Request(response.urljoin(url))?for?url?in?urls)??????28.print('++++++++++++++++++++++++')???29.????30.subjects?=?response.xpath('//li[@class="subject-item"]')????31.for?subject?in?subjects:???32.#?解決圖書(shū)副標(biāo)題拼接????33.title?=?"".join(map(lambda?x:x.strip(),?subject.xpath('.//h3/a//text()').extract()))????34.rate?=?subject.xpath('.//span[@class="rating_nums"]/text()').extract_first()????35.#print(rate)?#?有的沒(méi)有評(píng)分，要注意可能返回None???36.???37.item?=?BookItem()???38.item['title']?=?title???39.item['rate']?=?rate????40.#items.append(item)????41.yield?item???42.????43.#return?items

一篇文章搞定 Scrapy 爬蟲(chóng)框架

分享題目：一篇文章搞定Scrapy爬蟲(chóng)框架-創(chuàng)新互聯(lián)
文章URL：http://weahome.cn/article/dgojpj.html

真实的国产乱ⅩXXX66竹夫人,五月香六月婷婷激情综合,亚洲日本VA一区二区三区,亚洲精品一区二区三区麻豆

一篇文章搞定Scrapy爬蟲(chóng)框架-創(chuàng)新互聯(lián)

其他資訊

網(wǎng)站制作

企業(yè)服務(wù)

網(wǎng)站建設(shè)

服務(wù)器托管