Python爬取成語接龍類網(wǎng)站-創(chuàng)新互聯(lián)

介紹

創(chuàng)新互聯(lián)建站專注于企業(yè)全網(wǎng)營銷推廣、網(wǎng)站重做改版、嶗山網(wǎng)站定制設(shè)計、自適應(yīng)品牌網(wǎng)站建設(shè)、H5高端網(wǎng)站建設(shè)、商城建設(shè)、集團公司官網(wǎng)建設(shè)、外貿(mào)營銷網(wǎng)站建設(shè)、高端網(wǎng)站制作、響應(yīng)式網(wǎng)頁設(shè)計等建站業(yè)務(wù)，價格優(yōu)惠性價比高，為嶗山等各大城市提供網(wǎng)站開發(fā)制作服務(wù)。

本文將展示如何利用Python爬蟲來實現(xiàn)詩歌接龍。

該項目的思路如下：

利用爬蟲爬取詩歌，制作詩歌語料庫；

將詩歌分句，形成字典：鍵（key）為該句首字的拼音，值（value）為該拼音對應(yīng)的詩句，并將字典保存為pickle文件；
讀取pickle文件，編寫程序，以exe文件形式運行該程序。

該項目實現(xiàn)的詩歌接龍，規(guī)則為下一句的首字與上一句的尾字的拼音（包括聲調(diào)）一致。下面將分步講述該項目的實現(xiàn)過程。

詩歌語料庫

首先，我們利用Python爬蟲來爬取詩歌，制作語料庫。爬取的網(wǎng)址為：https://www.gushiwen.org，頁面如下：

Python爬取成語接龍類網(wǎng)站

由于本文主要為試了展示該項目的思路，因此，只爬取了該頁面中的唐詩三百首、古詩三百、宋詞三百、宋詞精選，一共大約1100多首詩歌。為了加速爬蟲，采用并發(fā)實現(xiàn)爬蟲，并保存到poem.txt文件。完整的Python程序如下：

import re
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED

# 爬取的詩歌網(wǎng)址
urls = ['https://so.gushiwen.org/gushi/tangshi.aspx',
  'https://so.gushiwen.org/gushi/sanbai.aspx',
  'https://so.gushiwen.org/gushi/songsan.aspx',
  'https://so.gushiwen.org/gushi/songci.aspx'
  ]

poem_links = []
# 詩歌的網(wǎng)址
for url in urls:
 # 請求頭部
 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
 req = requests.get(url, headers=headers)

 soup = BeautifulSoup(req.text, "lxml")
 content = soup.find_all('div', class_="sons")[0]
 links = content.find_all('a')

 for link in links:
  poem_links.append('https://so.gushiwen.org'+link['href'])

poem_list = []
# 爬取詩歌頁面
def get_poem(url):
 #url = 'https://so.gushiwen.org/shiwenv_45c396367f59.aspx'
 # 請求頭部
 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
 req = requests.get(url, headers=headers)
 soup = BeautifulSoup(req.text, "lxml")
 poem = soup.find('div', class_='contson').text.strip()
 poem = poem.replace(' ', '')
 poem = re.sub(re.compile(r"\([\s\S]*?\)"), '', poem)
 poem = re.sub(re.compile(r"（[\s\S]*?）"), '', poem)
 poem = re.sub(re.compile(r"。\([\s\S]*?）"), '', poem)
 poem = poem.replace('!', '！').replace('?', '？')
 poem_list.append(poem)

# 利用并發(fā)爬取
executor = ThreadPoolExecutor(max_workers=10) # 可以自己調(diào)整max_workers,即線程的個數(shù)
# submit()的參數(shù)： 第一個為函數(shù)， 之后為該函數(shù)的傳入?yún)?shù)，允許有多個
future_tasks = [executor.submit(get_poem, url) for url in poem_links]
# 等待所有的線程完成，才進入后續(xù)的執(zhí)行
wait(future_tasks, return_when=ALL_COMPLETED)

# 將爬取的詩句寫入txt文件
poems = list(set(poem_list))
poems = sorted(poems, key=lambda x:len(x))
for poem in poems:
 poem = poem.replace('《','').replace('》','') \
    .replace('：', '').replace('“', '')
 print(poem)
 with open('F://poem.txt', 'a') as f:
  f.write(poem)
  f.write('\n')

另外有需要云服務(wù)器可以了解下創(chuàng)新互聯(lián)scvps.cn，海內(nèi)外云服務(wù)器15元起步，三天無理由+7*72小時售后在線，公司持有idc許可證，提供“云服務(wù)器、裸金屬服務(wù)器、高防服務(wù)器、香港服務(wù)器、美國服務(wù)器、虛擬主機、免備案服務(wù)器”等云主機租用服務(wù)以及企業(yè)上云的綜合解決方案，具有“安全穩(wěn)定、簡單易用、服務(wù)可用性高、性價比高”等特點與優(yōu)勢，專為企業(yè)上云打造定制，能夠滿足用戶豐富、多元化的應(yīng)用場景需求。

當(dāng)前名稱：Python爬取成語接龍類網(wǎng)站-創(chuàng)新互聯(lián)
本文URL：http://weahome.cn/article/dosopc.html

真实的国产乱ⅩXXX66竹夫人,五月香六月婷婷激情综合,亚洲日本VA一区二区三区,亚洲精品一区二区三区麻豆

Python爬取成語接龍類網(wǎng)站-創(chuàng)新互聯(lián)

其他資訊

網(wǎng)站制作

企業(yè)服務(wù)

網(wǎng)站建設(shè)

服務(wù)器托管