這篇文章主要介紹了Python爬蟲(chóng)網(wǎng)站的代碼怎么寫的相關(guān)知識(shí),內(nèi)容詳細(xì)易懂,操作簡(jiǎn)單快捷,具有一定借鑒價(jià)值,相信大家閱讀完這篇Python爬蟲(chóng)網(wǎng)站的代碼怎么寫文章都會(huì)有所收獲,下面我們一起來(lái)看看吧。
創(chuàng)新互聯(lián)公司專業(yè)為企業(yè)提供荔浦網(wǎng)站建設(shè)、荔浦做網(wǎng)站、荔浦網(wǎng)站設(shè)計(jì)、荔浦網(wǎng)站制作等企業(yè)網(wǎng)站建設(shè)、網(wǎng)頁(yè)設(shè)計(jì)與制作、荔浦企業(yè)網(wǎng)站模板建站服務(wù),十余年荔浦做網(wǎng)站經(jīng)驗(yàn),不只是建網(wǎng)站,更提供有價(jià)值的思路和整體網(wǎng)絡(luò)服務(wù)。
import requests
import json
import os
import time
import random
import jieba
from wordcloud import WordCloud
from imageio import imread
comments_file_path = 'jd_comments.txt'
def get_jd_comments(page = 0):
#獲取jd評(píng)論
url ='https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=1340204&score=0&sortType=5&page=%s&pageSize=10&isShadowSku=0&fold=1'%page
headers = {
#從哪個(gè)頁(yè)面發(fā)出的數(shù)據(jù)申請(qǐng),每個(gè)網(wǎng)站都是不一樣的
'referer': 'https://item.jd.com/1340204.html',
#'user-agent'指的是用戶代理,也就是讓網(wǎng)站知道你用的哪個(gè)瀏覽器登錄的
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',
#哪一類用戶想要看數(shù)據(jù),是游客還是會(huì)員,建議使用登錄后的
'cookie': '__jdu=1766075400; areaId=27; PCSYCityID=CN_610000_610100_610113; shshshfpa=a9dc241f-78b8-f3e1-edab-09485009987f-1585747224; shshshfpb=dwWV9IhxtSce3DU0STB1%20TQ%3D%3D; jwotest_product=99; unpl=V2_ZzNtbRAAFhJ3DUJTfhFcUGIAE1RKU0ZCdQoWU3kQXgcwBxJdclRCFnQUR1FnGF8UZAMZWEpcRhFFCEdkeBBVAWMDE1VGZxBFLV0CFSNGF1wjU00zQwBBQHcJFF0uSgwDYgcaDhFTQEJ2XBVQL0oMDDdRFAhyZ0AVRQhHZHsfWwJmBRZYQ1ZzJXI4dmR9EFoAYjMTbUNnAUEpDURSeRhbSGcFFVpDUUcQdAl2VUsa; __jdv=76161171|baidu-pinzhuan|t_288551095_baidupinzhuan|cpc|0f3d30c8dba7459bb52f2eb5eba8ac7d_0_cfd63456491d4208954f13a63833f511|1585835385193; __jda=122270672.1766075400.1585747219.1585829967.1585835353.3; __jdc=122270672; 3AB9D23F7A4B3C9B=AXAFRBHRKYDEJAQ4SPJBVU4J4TI6OQHDFRDGI7ISQFUQGA6OZOQN52T3QYSRWPSIHTFRYRN2QEG7AMEV2JG6NT2DFM; shshshfp=03ed62977bfa44b85be24ef65fbd9b87; ipLoc-djd=27-2376-4343-53952; JSESSIONID=51895EFB4EBD95BA3B3ADAC8C6C73CD8.s1; shshshsID=d2435956e0c158fa7db1980c3053033d_15_1585836826172; __jdb=122270672.16.1766075400|3.1585835353'
}
try:
response = requests.get(url, headers = headers)
except:
print('something wrong!')
#獲取json格式數(shù)據(jù)集
comments_json = response.text[20:-2]
#將獲取到的json數(shù)據(jù)集轉(zhuǎn)換為json對(duì)象
comments_json_obj = json.loads(comments_json)
#獲取comments里面全部的內(nèi)容
comments_all = comments_json_obj['comments']
for comment in comments_all:
with open(comments_file_path, 'a+', encoding = 'utf-8') as fin:
fin.write(comment['content'] + '\n')
print(comment['content'])
def batch_jd_comments():
#每次寫數(shù)據(jù)之前先清空
if os.path.exists(comments_file_path):
os.remove(comments_file_path)
#我們指定page i的值時(shí),它就可以獲取固定頁(yè)面的評(píng)論。
for i in range(30):
print('正在爬取'+str(i+1)+'頁(yè)的數(shù)據(jù)....')
get_jd_comments(i)
#設(shè)置time用來(lái)模擬用戶瀏覽,防止因?yàn)榕廊√l繁導(dǎo)致ip被封。
time.sleep(random.random()*5)
#對(duì)獲取到的數(shù)據(jù)進(jìn)行分詞
def cut_comments():
with open(comments_file_path, encoding='utf-8')as file:
comment_text = file.read()
wordlist = jieba.lcut_for_search(comment_text)
new_wordlist = ' '.join(wordlist)
return new_wordlist
#引入圖片byt.jpg來(lái)制作相同形狀的詞云圖
def create_word_cloud():
mask = imread('byt.jpg')
wordcloud = WordCloud(font_path='msyh.ttc',mask = mask).generate(cut_comments())
wordcloud.to_file('picture.png')
if __name__ == '__main__':
create_word_cloud()
關(guān)于“Python爬蟲(chóng)網(wǎng)站的代碼怎么寫”這篇文章的內(nèi)容就介紹到這里,感謝各位的閱讀!相信大家對(duì)“Python爬蟲(chóng)網(wǎng)站的代碼怎么寫”知識(shí)都有一定的了解,大家如果還想學(xué)習(xí)更多知識(shí),歡迎關(guān)注創(chuàng)新互聯(lián)行業(yè)資訊頻道。