85行代碼實現(xiàn)多線程+數(shù)據(jù)文件操作+數(shù)據(jù)庫存儲的爬蟲實例

寫在前面

這是我在接觸爬蟲后，寫的第二個爬蟲實例。
也是我在學(xué)習python后真正意義上寫的第二個小項目，第一個小項目就是第一個爬蟲了。

連山ssl適用于網(wǎng)站、小程序/APP、API接口等需要進行數(shù)據(jù)傳輸應(yīng)用場景，ssl證書未來市場廣闊！成為創(chuàng)新互聯(lián)的ssl證書銷售渠道，可以享受市場價格4-6折優(yōu)惠！如果有意向歡迎電話聯(lián)系或者加微信：18982081108（備注：SSL證書合作）期待與您的合作！

我從學(xué)習python到現(xiàn)在，也就三個星期不到，平時課程比較多，python是額外學(xué)習的，每天學(xué)習python的時間也就一個小時左右。
所以我目前對于python也不是特別了解，如果代碼以及理解方面存在錯誤，歡迎大家的指正。

爬取的網(wǎng)站

這是一個推薦網(wǎng)絡(luò)小說的網(wǎng)站。
https://www.tuishujun.com/

我之前用以下的代碼實例，爬取了這個網(wǎng)站所有的小說數(shù)據(jù)，大概有十七萬左右。
大概花了6個小時的時間，效率還是不錯的，如果是在單線程的情況下，我估計在不停機24小時爬取的情況下，也需要幾天。

我在剛開始寫這個爬蟲實例的時候，也遇到了很多問題，首先就是網(wǎng)上雖然有很多關(guān)于python多線程爬蟲的東西，但...

除此之外，關(guān)于利用多線程操作數(shù)據(jù)庫的爬蟲實例也是比較少。

就解決以上問題，我找了很多資料，走了不少彎路，摸索了幾天才寫出了以下實例。

大家可以參考以下實例，進行拓展，寫出屬于自己的多線程爬蟲。

需要注意的點：
在實例中我使用了ThreadPoolExecutor構(gòu)造線程池的方式（大家可以找找這方面的資料看看），如果你在使用多線程的時候想要操作數(shù)據(jù)庫存儲數(shù)據(jù)，建議使用以上方式，要不然你會發(fā)現(xiàn)，在運行代碼時出現(xiàn)各種各樣的錯誤。

代碼實例

import requests
import pymysql
import os
from lxml import etree
from fake_useragent import UserAgent
from concurrent.futures import ThreadPoolExecutor


class tuishujunSpider(object):
    def __init__(self):
        if not os.path.exists('db/tuishujun'):
            os.makedirs('db/tuishujun')
        else:
            pass
        self.f = open('./db/tuishujun/tuishujun.txt', 'a', encoding='utf-8')
        self.con = pymysql.connect(host='localhost', user='root', password='', database='novel',
                                   charset='utf8', port=3306)
        self.cursor = self.con.cursor()
        self.cursor.execute(" SHOW TABLES LIKE 'tuishujun' ")
        judge = self.cursor.fetchone()
        if judge:
            pass
        else:
            self.cursor.execute("""create table tuishujun
                            ( id BIGINT NOT NULL AUTO_INCREMENT,
                              cover VARCHAR(255),
                              name VARCHAR(255),
                              author VARCHAR(255),
                              source VARCHAR(255),
                              intro LONGTEXT,
                              PRIMARY KEY (id))
                           """)
        self.con.commit()
        self.cursor.close()
        self.con.close()

    def start(self, page):
        con = pymysql.connect(
            host='localhost', user='root', password='', database='novel', charset='utf8', port=3306)
        cursor = con.cursor()
        headers = {
            'User-Agent': UserAgent().random
        }
        url = 'https://www.tuishujun.com/books/' + str(page)
        r = requests.get(url, headers=headers)
        if r.status_code == 500:
            return
        else:
            html = etree.HTML(r.text)
            book = {}
            book['id'] = str(page)
            try:
                cover = html.xpath('//*[@id="__layout"]/div/div[2]/div/div[1]/div/div[1]/div[1]/div[1]/img/@src')[0]
            except IndexError:
                cover = ''
            book['cover'] = cover
            name = \
                html.xpath(
                    '//*[@id="__layout"]/div/div[2]/div/div[1]/div/div[1]/div[1]/div[2]/div/div[1]/h3/text()')[0]
            book['name'] = name
            author = \
                html.xpath(
                    '//*[@id="__layout"]/div/div[2]/div/div[1]/div/div[1]/div[1]/div[2]/div/div[2]/a/text()')[
                    0].strip()
            author = author.replace("\n", "")
            book['author'] = author
            source = \
                html.xpath('//*[@id="__layout"]/div/div[2]/div/div[1]/div/div[1]/div[1]/div[2]/div/div[5]/text()')[
                    0]
            book['source'] = source
            intro = html.xpath('//*[@id="__layout"]/div/div[2]/div/div[1]/div/div[1]/div[2]/text()')[0]
            intro = intro.replace(" ", "")
            intro = intro.replace("\n", "")
            book['intro'] = intro
            self.f.write(str(book) + '\n')
            cursor.execute("insert into tuishujun(id,cover,name,author,source,intro) "
                           "values(%s,%s,%s,%s,%s,%s)",
                           (book['id'], book['cover'], book['name'], book['author'],
                            book['source'], book['intro']))
            con.commit()
            cursor.close()
            con.close()
            print(book)

    def run(self):
        pages = range(1, )
        with ThreadPoolExecutor() as pool:
            pool.map(self.start, pages)


if __name__ == '__main__':
    spider = tuishujunSpider()
    spider.run()

當前標題：85行代碼實現(xiàn)多線程+數(shù)據(jù)文件操作+數(shù)據(jù)庫存儲的爬蟲實例
文章來源：http://weahome.cn/article/dsogghj.html

真实的国产乱ⅩXXX66竹夫人,五月香六月婷婷激情综合,亚洲日本VA一区二区三区,亚洲精品一区二区三区麻豆

85行代碼實現(xiàn)多線程+數(shù)據(jù)文件操作+數(shù)據(jù)庫存儲的爬蟲實例

寫在前面

爬取的網(wǎng)站

代碼實例

其他資訊

網(wǎng)站制作

企業(yè)服務(wù)

網(wǎng)站建設(shè)

服務(wù)器托管