這篇文章給大家介紹怎么在scrapy中利用phantomJS實(shí)現(xiàn)異步爬取,內(nèi)容非常詳細(xì),感興趣的小伙伴們可以參考借鑒,希望對(duì)大家能有所幫助。
專注于為中小企業(yè)提供成都網(wǎng)站制作、成都做網(wǎng)站、外貿(mào)營(yíng)銷網(wǎng)站建設(shè)服務(wù),電腦端+手機(jī)端+微信端的三站合一,更高效的管理,為中小企業(yè)朝陽(yáng)縣免費(fèi)做網(wǎng)站提供優(yōu)質(zhì)的服務(wù)。我們立足成都,凝聚了一批互聯(lián)網(wǎng)行業(yè)人才,有力地推動(dòng)了上1000家企業(yè)的穩(wěn)健成長(zhǎng),幫助中小企業(yè)通過(guò)網(wǎng)站建設(shè)實(shí)現(xiàn)規(guī)模擴(kuò)充和轉(zhuǎn)變。使用時(shí)需要PhantomJSDownloadHandler添加到配置文件的DOWNLOADER中。
# encoding: utf-8 from __future__ import unicode_literals from scrapy import signals from scrapy.signalmanager import SignalManager from scrapy.responsetypes import responsetypes from scrapy.xlib.pydispatch import dispatcher from selenium import webdriver from six.moves import queue from twisted.internet import defer, threads from twisted.python.failure import Failure class PhantomJSDownloadHandler(object): def __init__(self, settings): self.options = settings.get('PHANTOMJS_OPTIONS', {}) max_run = settings.get('PHANTOMJS_MAXRUN', 10) self.sem = defer.DeferredSemaphore(max_run) self.queue = queue.LifoQueue(max_run) SignalManager(dispatcher.Any).connect(self._close, signal=signals.spider_closed) def download_request(self, request, spider): """use semaphore to guard a phantomjs pool""" return self.sem.run(self._wait_request, request, spider) def _wait_request(self, request, spider): try: driver = self.queue.get_nowait() except queue.Empty: driver = webdriver.PhantomJS(**self.options) driver.get(request.url) # ghostdriver won't response when switch window until page is loaded dfd = threads.deferToThread(lambda: driver.switch_to.window(driver.current_window_handle)) dfd.addCallback(self._response, driver, spider) return dfd def _response(self, _, driver, spider): body = driver.execute_script("return document.documentElement.innerHTML") if body.startswith(""): # cannot access response header in Selenium body = driver.execute_script("return document.documentElement.textContent") url = driver.current_url respcls = responsetypes.from_args(url=url, body=body[:100].encode('utf8')) resp = respcls(url=url, body=body, encoding="utf-8") response_failed = getattr(spider, "response_failed", None) if response_failed and callable(response_failed) and response_failed(resp, driver): driver.close() return defer.fail(Failure()) else: self.queue.put(driver) return defer.succeed(resp) def _close(self): while not self.queue.empty(): driver = self.queue.get_nowait() driver.close()
關(guān)于怎么在scrapy中利用phantomJS實(shí)現(xiàn)異步爬取就分享到這里了,希望以上內(nèi)容可以對(duì)大家有一定的幫助,可以學(xué)到更多知識(shí)。如果覺(jué)得文章不錯(cuò),可以把它分享出去讓更多的人看到。