一、requests庫基礎(chǔ)知識(shí)
10年積累的成都網(wǎng)站設(shè)計(jì)、做網(wǎng)站經(jīng)驗(yàn),可以快速應(yīng)對(duì)客戶對(duì)網(wǎng)站的新想法和需求。提供各種問題對(duì)應(yīng)的解決方案。讓選擇我們的客戶得到更好、更有力的網(wǎng)絡(luò)服務(wù)。我雖然不認(rèn)識(shí)你,你也不認(rèn)識(shí)我。但先建設(shè)網(wǎng)站后付款的網(wǎng)站建設(shè)流程,更有玉山免費(fèi)網(wǎng)站建設(shè)讓你可以放心的選擇與我們合作。
Requests的方法
requests庫的response對(duì)象
二、爬取網(wǎng)站所需信息
1.訪問網(wǎng)站,如圖1-1所示:
圖1-1
2.點(diǎn)擊子頁面,審查網(wǎng)頁元素,部分內(nèi)容如圖1-2所示:
圖1-2
3.實(shí)現(xiàn)代碼如下:
#coding:utf-8 import requests from bs4 import BeautifulSoup import xlsxwriter #定義網(wǎng)頁內(nèi)容獲取函數(shù)GET_HTML_CONTENT def GET_HTML_CONTENT(url): #定義user_agent,模擬瀏覽器訪問網(wǎng)頁 user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) \ Chrome/63.0.3239.132 Safari/537.36' headers = {'User-Agent':user_agent} r = requests.get(url,headers=headers) #獲取網(wǎng)頁內(nèi)容 html_str = r.text return html_str #定義子網(wǎng)頁URL獲取函數(shù)GET_CHILD_URL def GET_CHILD_URL(content): data = BeautifulSoup(content, "html.parser") genre_session = data.find_all('li', attrs={'class': "medium listbox group"}) #定義一個(gè)空列表childurl存放類別名稱及子網(wǎng)頁URL childurl = [] for session in genre_session: elements = session.find_all('h4', attrs={'class': "heading"}) for element in elements: genre = {} genre['name'] = element.find('a').text genre['nextpage'] = element.find('a')['href'] childurl.append(genre) return childurl #定義子網(wǎng)頁內(nèi)容處理函數(shù)GET_CHILD_INFO def GET_CHILD_INFO(content,kind): data = BeautifulSoup(content, "html.parser") book_session = data.find_all('ol', attrs={'class': "alphabet fandom index group "}) items = book_session[0].find_all('ul', attrs={'class': "tags index group"}) #定義一個(gè)空列表books存放書的類別、名稱及評(píng)論數(shù) books = [] for item in items: book = {} book['kinds'] = kind book['name'] = item.find('a').text book['reviews'] = item.text.strip().split('\n')[-1].strip().strip('()') books.append(book) return books if __name__ == '__main__': url = 'https://archiveofourown.org/media' content = GET_HTML_CONTENT(url) childurl = GET_CHILD_URL(content) row = 1 col = 0 data = [[u'類別',u'名稱',u'評(píng)論數(shù)']] workbook = xlsxwriter.Workbook("data.xlsx") worksheet = workbook.add_worksheet() worksheet.write_row(0,0,data[0]) for k in childurl: kind = k['name'] nexturl = k['nextpage'] geturl = 'https://archiveofourown.org' + nexturl txt = GET_HTML_CONTENT(geturl) books = GET_CHILD_INFO(txt,kind) for info in books: worksheet.write(row, col, info['kinds']) worksheet.write(row, col + 1, info['name']) worksheet.write(row, col + 2, info['reviews']) row += 1 workbook.close()
4.運(yùn)行結(jié)果如圖1-3所示:
圖1-3