作為一個(gè)活躍在京津冀地區(qū)的開(kāi)發(fā)者,要閑著沒(méi)事就看看石家莊
這個(gè)國(guó)際化大都市的一些數(shù)據(jù),這篇博客爬取了鏈家網(wǎng)的租房信息,爬取到的數(shù)據(jù)在后面的博客中可以作為一些數(shù)據(jù)分析的素材。
我們需要爬取的網(wǎng)址為:https://sjz.lianjia.com/zufang/
順城網(wǎng)站建設(shè)公司成都創(chuàng)新互聯(lián)公司,順城網(wǎng)站設(shè)計(jì)制作,有大型網(wǎng)站制作公司豐富經(jīng)驗(yàn)。已為順城上1000家提供企業(yè)網(wǎng)站建設(shè)服務(wù)。企業(yè)網(wǎng)站搭建\外貿(mào)網(wǎng)站制作要多少錢(qián),請(qǐng)找那個(gè)售后服務(wù)好的順城做網(wǎng)站的公司定做!
首先確定一下,哪些數(shù)據(jù)是我們需要的
可以看到,×××框就是我們需要的數(shù)據(jù)。
接下來(lái),確定一下翻頁(yè)規(guī)律
https://sjz.lianjia.com/zufang/pg1/
https://sjz.lianjia.com/zufang/pg2/
https://sjz.lianjia.com/zufang/pg3/
https://sjz.lianjia.com/zufang/pg4/
https://sjz.lianjia.com/zufang/pg5/
...
https://sjz.lianjia.com/zufang/pg80/
Python資源分享qun 784758214 ,內(nèi)有安裝包,PDF,學(xué)習(xí)視頻,這里是Python學(xué)習(xí)者的聚集地,零基礎(chǔ),進(jìn)階,都?xì)g迎
有了分頁(yè)地址,就可以快速把鏈接拼接完畢,我們采用lxml
模塊解析網(wǎng)頁(yè)源碼,獲取想要的數(shù)據(jù)。
本次編碼使用了一個(gè)新的模塊 fake_useragent
,這個(gè)模塊,可以隨機(jī)的去獲取一個(gè)UA(user-agent),模塊使用比較簡(jiǎn)單,可以去百度百度就很多教程。
本篇博客主要使用的是調(diào)用一個(gè)隨機(jī)的UA
self._ua = UserAgent()
self._headers = {"User-Agent": self._ua.random} # 調(diào)用一個(gè)隨機(jī)的UA
由于可以快速的把頁(yè)碼拼接出來(lái),所以采用協(xié)程進(jìn)行抓取,寫(xiě)入csv文件采用的pandas
模塊
from fake_useragent import UserAgent
from lxml import etree
import asyncio
import aiohttp
import pandas as pd
class LianjiaSpider(object):
def __init__(self):
self._ua = UserAgent()
self._headers = {"User-Agent": self._ua.random}
self._data = list()
async def get(self,url):
async with aiohttp.ClientSession() as session:
try:
async with session.get(url,headers=self._headers,timeout=3) as resp:
if resp.status==200:
result = await resp.text()
return result
except Exception as e:
print(e.args)
async def parse_html(self):
for page in range(1,77):
url = "https://sjz.lianjia.com/zufang/pg{}/".format(page)
print("正在爬取{}".format(url))
html = await self.get(url) # 獲取網(wǎng)頁(yè)內(nèi)容
html = etree.HTML(html) # 解析網(wǎng)頁(yè)
self.parse_page(html) # 匹配我們想要的數(shù)據(jù)
print("正在存儲(chǔ)數(shù)據(jù)....")
######################### 數(shù)據(jù)寫(xiě)入
data = pd.DataFrame(self._data)
data.to_csv("鏈家網(wǎng)租房數(shù)據(jù).csv", encoding='utf_8_sig') # 寫(xiě)入文件
######################### 數(shù)據(jù)寫(xiě)入
def run(self):
loop = asyncio.get_event_loop()
tasks = [asyncio.ensure_future(self.parse_html())]
loop.run_until_complete(asyncio.wait(tasks))
if __name__ == '__main__':
l = LianjiaSpider()
l.run()
上述代碼中缺少一個(gè)解析網(wǎng)頁(yè)的函數(shù),我們接下來(lái)把他補(bǔ)全
def parse_page(self,html):
info_panel = html.xpath("http://div[@class='info-panel']")
for info in info_panel:
region = self.remove_space(info.xpath(".//span[@class='region']/text()"))
zone = self.remove_space(info.xpath(".//span[@class='zone']/span/text()"))
meters = self.remove_space(info.xpath(".//span[@class='meters']/text()"))
where = self.remove_space(info.xpath(".//div[@class='where']/span[4]/text()"))
con = info.xpath(".//div[@class='con']/text()")
floor = con[0] # 樓層
type = con[1] # 樣式
agent = info.xpath(".//div[@class='con']/a/text()")[0]
has = info.xpath(".//div[@class='left agency']//text()")
price = info.xpath(".//div[@class='price']/span/text()")[0]
price_pre = info.xpath(".//div[@class='price-pre']/text()")[0]
look_num = info.xpath(".//div[@class='square']//span[@class='num']/text()")[0]
one_data = {
"region":region,
"zone":zone,
"meters":meters,
"where":where,
"louceng":floor,
"type":type,
"xiaoshou":agent,
"has":has,
"price":price,
"price_pre":price_pre,
"num":look_num
}
self._data.append(one_data) # 添加數(shù)據(jù)
Python資源分享qun 784758214 ,內(nèi)有安裝包,PDF,學(xué)習(xí)視頻,這里是Python學(xué)習(xí)者的聚集地,零基礎(chǔ),進(jìn)階,都?xì)g迎
不一會(huì),數(shù)據(jù)就爬取的差不多了。