Pythonpyspider怎么使用

這篇文章主要介紹了Python pyspider怎么使用的相關(guān)知識(shí)，內(nèi)容詳細(xì)易懂，操作簡(jiǎn)單快捷，具有一定借鑒價(jià)值，相信大家閱讀完這篇Python pyspider怎么使用文章都會(huì)有所收獲，下面我們一起來看看吧。

創(chuàng)新互聯(lián)公司10多年成都定制網(wǎng)頁設(shè)計(jì)服務(wù);為您提供網(wǎng)站建設(shè),網(wǎng)站制作,網(wǎng)頁設(shè)計(jì)及高端網(wǎng)站定制服務(wù),成都定制網(wǎng)頁設(shè)計(jì)及推廣,對(duì)成都花箱等多個(gè)行業(yè)擁有豐富的網(wǎng)站維護(hù)經(jīng)驗(yàn)的網(wǎng)站建設(shè)公司。

1 簡(jiǎn)介

pyspider 是一個(gè)支持任務(wù)監(jiān)控、項(xiàng)目管理、多種數(shù)據(jù)庫，具有 WebUI 的爬蟲框架，它采用 Python 語言編寫，分布式架構(gòu)。詳細(xì)特性如下：

擁有 Web 腳本編輯界面，任務(wù)監(jiān)控器，項(xiàng)目管理器和結(jié)構(gòu)查看器；
數(shù)據(jù)庫支持 MySQL、MongoDB、redis、SQLite、Elasticsearch、PostgreSQL、SQLAlchemy；
隊(duì)列服務(wù)支持 RabbitMQ、Beanstalk、Redis、Kombu；
支持抓取 JavaScript 的頁面；
組件可替換，支持單機(jī)、分布式部署，支持 Docker 部署；
強(qiáng)大的調(diào)度控制，支持超時(shí)重爬及優(yōu)先級(jí)設(shè)置；
支持 Python2&3。

pyspider 主要分為 Scheduler（調(diào)度器）、 Fetcher（抓取器）、 Processer（處理器）三個(gè)部分，整個(gè)爬取過程受到 Monitor（監(jiān)控器）的監(jiān)控，抓取的結(jié)果被 Result Worker（結(jié)果處理器）處理?；玖鞒虨椋篠cheduler 發(fā)起任務(wù)調(diào)度，F(xiàn)etcher 抓取網(wǎng)頁內(nèi)容，Processer 解析網(wǎng)頁內(nèi)容，再將新生成的 Request 發(fā)給 Scheduler 進(jìn)行調(diào)度，將生成的提取結(jié)果輸出保存。

2 pyspider vs scrapy

pyspider 擁有 WebUI，爬蟲的編寫、調(diào)試可在 WebUI 中進(jìn)行；Scrapy 采用采用代碼、命令行操作，實(shí)現(xiàn)可視化需對(duì)接 Portia。
pyspider 支持使用 PhantomJS 對(duì) JavaScript 渲染頁面的采集；Scrapy 需對(duì)接 Scrapy-Splash 組件。
pyspider 內(nèi)置了 PyQuery（Python 爬蟲（五）：PyQuery 框架）作為選擇器；Scrapy 對(duì)接了 XPath、CSS 選擇器、正則匹配。
pyspider 擴(kuò)展性弱；Scrapy 模塊之間耦合度低，擴(kuò)展性強(qiáng)，如：對(duì)接 Middleware、 Pipeline 等組件實(shí)現(xiàn)更強(qiáng)功能。

總的來說，pyspider 更加便捷，Scrapy 擴(kuò)展性更強(qiáng)，如果要快速實(shí)現(xiàn)爬取優(yōu)選 pyspider，如果爬取規(guī)模較大、反爬機(jī)制較強(qiáng)，優(yōu)選 scrapy。

3 安裝

方式一

pip install pyspider

這種方式比較簡(jiǎn)單，不過在 Windows 系統(tǒng)上可能會(huì)出現(xiàn)錯(cuò)誤：Command "python setup.py egg_info" failed with error ...，我在自己的 Windows 系統(tǒng)上安裝時(shí)就遇到了該問題，因此，選擇了下面第二種方式進(jìn)行了安裝。

方式二

使用 wheel 方式安裝。步驟如下：

pip install wheel 安裝 wheel；
打開網(wǎng)址 https://www.lfd.uci.edu/~gohlke/pythonlibs/，使用 Ctrl + F 搜索 pycurl，根據(jù)自己安裝的 Python 版本，選擇合適的版本下載，比如：我用的 Python3.6，就選擇帶有 cp36 標(biāo)識(shí)的版本。如下圖紅框所示：

Python pyspider怎么使用

使用 pip 安裝下載文件，如：pip install E:\pycurl-7.43.0.3-cp36-cp36m-win_amd64.whl；
最后還是使用 pip install pyspider 安裝。

執(zhí)行以上安裝步驟后，我們?cè)诳刂婆_(tái)輸入 pyspider，如圖所示：

Python pyspider怎么使用出現(xiàn)上述結(jié)果說明啟動(dòng)成功，如果啟動(dòng)時(shí)一直卡在 result_worker starting...，我們可以再打開一個(gè)控制臺(tái)窗口，同樣輸入 pyspider 進(jìn)行啟動(dòng)，啟動(dòng)成功后關(guān)掉之前的窗口即可。

啟動(dòng)成功后，我們?cè)衮?yàn)證一下，打開瀏覽器，輸入 http://localhost:5000 訪問，如圖所示：

Python pyspider怎么使用我們發(fā)現(xiàn)確實(shí)啟動(dòng)成功了。

4 快速上手

4.1 創(chuàng)建項(xiàng)目

首先，我們點(diǎn)擊圖形界面中的 Create 按鈕開始創(chuàng)建項(xiàng)目，如圖中紅框所示：

Python pyspider怎么使用然后會(huì)跳出信息填寫窗口，如圖所示：

Python pyspider怎么使用

Project Name：項(xiàng)目名
Start URL(s)：爬取鏈接地址

我們需要填寫 Project Name 和 Start URL(s)，這里以鏈家網(wǎng)二手房信息為例：https://hz.lianjia.com/ershoufang，填寫完成后點(diǎn)擊 Create 按鈕。結(jié)果如圖所示：

Python pyspider怎么使用

4.2 爬蟲實(shí)現(xiàn)

pyspider 訪問 https 協(xié)議的網(wǎng)站時(shí)會(huì)提示證書問題（通常為 HTTP 599），因此我們需要在 crawl 方法中添加參數(shù) validate_cert=False 來屏蔽證書驗(yàn)證。如圖所示：

Python pyspider怎么使用我們計(jì)劃獲取房子的單價(jià)（unit_price）、描述標(biāo)題（title）、賣點(diǎn)信息（sell_point），編寫具體實(shí)現(xiàn)如下所示：

from pyspider.libs.base_handler import *

class Handler(BaseHandler):    crawl_config = {    }
    @every(minutes=24 * 60)    def on_start(self):        self.crawl('https://hz.lianjia.com/ershoufang/', callback=self.index_page,validate_cert=False)
    @config(age=10 * 24 * 60 * 60)    def index_page(self, response):        for each in response.doc('.title').items():            self.crawl(each.attr.href, callback=self.detail_page,validate_cert=False)                @config(priority=2)    def detail_page(self, response):        yield {            'unit_price':response.doc('.unitPrice').text(),            'title': response.doc('.main').text(),            'sell_point': response.doc('.baseattribute > .content').text()        }

@every(minutes=24 * 60)：通知 Scheduler 每天運(yùn)行一次。
@config(age=10 * 24 * 60 * 60)：設(shè)置任務(wù)的有效期限。
@config(priority=2)：設(shè)定任務(wù)優(yōu)先級(jí)
on_start(self)：程序的入口。
self.crawl(url, callback)：主方法，用于創(chuàng)建一個(gè)爬取任務(wù)。
index_page(self, response)：用來抓取返回的 html 文檔中對(duì)應(yīng)標(biāo)簽的數(shù)據(jù)。
detail_page(self, response)：返回一個(gè) dict 對(duì)象作為結(jié)果。

我們點(diǎn)擊運(yùn)行按鈕，如圖所示：

Python pyspider怎么使用點(diǎn)擊之后，我們發(fā)現(xiàn) follows 按鈕處出現(xiàn)了提示信息，如圖所示：

Python pyspider怎么使用點(diǎn)擊 follows 按鈕，結(jié)果如圖所示：

Python pyspider怎么使用點(diǎn)擊上圖中紅框圈起來的三角號(hào)按鈕，結(jié)果如圖所示：

Python pyspider怎么使用我們隨意選一條 detail_page，點(diǎn)擊其右側(cè)三角號(hào)按鈕，結(jié)果如圖所示：

Python pyspider怎么使用從結(jié)果來看，已經(jīng)可以爬取到我們需要的信息了。

4.3 數(shù)據(jù)存儲(chǔ)

獲取到信息之后，需要將信息存儲(chǔ)起來，我們計(jì)劃將數(shù)據(jù)存儲(chǔ)到 MySQL 數(shù)據(jù)庫。

首先，安裝 pymysql，命令如下：

pip install pymysql

接著添加保存代碼，完整代碼如下：

from pyspider.libs.base_handler import *import pymysql
class Handler(BaseHandler):    crawl_config = {    }
    def __init__(self):        # 下面參數(shù)修改成自己對(duì)應(yīng)的 MySQL 信息         self.db = MySQLdb.connect(ip, username, password, db, charset='utf8')                 def add_Mysql(self, title, unit_price, sell_point):        try:            cursor = self.db.cursor()            sql = 'insert into house(title, unit_price, sell_point) values ("%s","%s","%s")' % (title[0],unit_price[0],sell_point);              print(sql)            cursor.execute(sql)            self.db.commit()        except Exception as e:            print(e)            self.db.rollback()        @every(minutes=24 * 60)    def on_start(self):        self.crawl('https://hz.lianjia.com/ershoufang/', callback=self.index_page,validate_cert=False)
    @config(age=10 * 24 * 60 * 60)    def index_page(self, response):        for each in response.doc('.title').items():            self.crawl(each.attr.href, callback=self.detail_page,validate_cert=False)
    @config(priority=2)    def detail_page(self, response):        title = response.doc('.main').text(),        unit_price = response.doc('.unitPrice').text(),        sell_point = response.doc('.baseattribute > .content').text()        self.add_Mysql(title, unit_price, sell_point)        yield {            'title': response.doc('.main').text(),            'unit_price':response.doc('.unitPrice').text(),            'sell_point': response.doc('.baseattribute > .content').text()        }

先測(cè)試一下是否能將數(shù)據(jù)保存到 MySQL 中，還是選一條 detail_page，如圖所示：

Python pyspider怎么使用點(diǎn)擊其右側(cè)三角號(hào)按鈕，結(jié)果如圖所示：

Python pyspider怎么使用從輸出結(jié)果來看是執(zhí)行了保存操作，我們?cè)俚?MySQL 中看一下，如圖所示：

Python pyspider怎么使用數(shù)據(jù)已經(jīng)存到了 MySQL 中了。

上面我們是手動(dòng)操作保存的數(shù)據(jù)，接下來看一下如何通過設(shè)置任務(wù)保存。

點(diǎn)擊當(dāng)前頁左上角的 pyspider 按鈕，如圖所示：

Python pyspider怎么使用返回 dashboard 界面，如圖所示：

Python pyspider怎么使用我們點(diǎn)擊 status 下方紅框圈住的位置，將狀態(tài)修改為 RUNNING 或 DEBUG，然后點(diǎn)擊 actions 下方的 run 按鈕即可。

關(guān)于“Python pyspider怎么使用”這篇文章的內(nèi)容就介紹到這里，感謝各位的閱讀！相信大家對(duì)“Python pyspider怎么使用”知識(shí)都有一定的了解，大家如果還想學(xué)習(xí)更多知識(shí)，歡迎關(guān)注創(chuàng)新互聯(lián)行業(yè)資訊頻道。

分享題目：Pythonpyspider怎么使用
當(dāng)前URL：http://weahome.cn/article/pciigj.html

真实的国产乱ⅩXXX66竹夫人,五月香六月婷婷激情综合,亚洲日本VA一区二区三区,亚洲精品一区二区三区麻豆

Pythonpyspider怎么使用

1 簡(jiǎn)介

2 pyspider vs scrapy

3 安裝

方式一

方式二

4 快速上手

4.1 創(chuàng)建項(xiàng)目

4.3 數(shù)據(jù)存儲(chǔ)

其他資訊

網(wǎng)站制作

企業(yè)服務(wù)

網(wǎng)站建設(shè)

服務(wù)器托管