快要過年了,大家都在忙些什么呢?一到年底公司各種搶票,備年貨,被這過年的氣氛一烘,都歸心似箭,哪還有心思上班啊。歸心似箭=產(chǎn)出低下=一行代碼十個錯=無聊。于是想起了以前學(xué)過一段時間的Python,自己平時也挺愛看電影的,手動點進去看電影詳情然后一部一部的去下載太煩了,何不用Python寫個自動下載電影的工具呢?誒,這么一想就不無聊了。以前還沒那么多XX會員的時候,想看看電影都是去XX天堂去找電影資源,大部分想看的電影還是有的,就它了,爬它!
創(chuàng)新互聯(lián)是專業(yè)的陽江網(wǎng)站建設(shè)公司,陽江接單;提供網(wǎng)站設(shè)計制作、網(wǎng)站建設(shè),網(wǎng)頁設(shè)計,網(wǎng)站設(shè)計,建網(wǎng)站,PHP網(wǎng)站建設(shè)等專業(yè)做網(wǎng)站服務(wù);采用PHP框架,可快速的進行陽江網(wǎng)站開發(fā)網(wǎng)頁制作和功能擴展;專業(yè)做搜索引擎喜愛的網(wǎng)站,專業(yè)的做網(wǎng)站團隊,希望更多企業(yè)前來合作!話說以前玩Python的時候爬過挺多網(wǎng)站的,都是在公司干的(Python不屬于公司的業(yè)務(wù)范圍,純屬自己折騰著好玩),我那個負責運維的同事天天跑過來說:你又在爬啥啊,你去看看新聞,某某爬東西又被抓了!出了事你自己負責?。“パ轿业哪镉H,嚇的都沒繼續(xù)玩下去了。這個博客是爬取某天堂的資源(具體是哪個天堂下面的代碼里會有的),會不會被抓???單純的作為技術(shù)討論,個人練手,不做商業(yè)用途應(yīng)該沒事吧?寫到這里小手不禁微微顫抖...
得嘞,死就死吧,我不入地獄誰入地獄,先看最終實現(xiàn)效果:
如上,這個下載工具是有界面的(牛皮吧),只要輸入一個根地址和電影評分,就可以自動爬電影了,要完成這個工具需要具備以下知識點:
差不多就這些了,至于實現(xiàn)的技術(shù)細節(jié)的話,也不多,requests+BeautifulSoup的使用,re正則,Python數(shù)據(jù)類型,Python線程,dbm、pickle等數(shù)據(jù)持久化庫的使用,等等,這個工具也就這么些知識范疇了。當然,Python是面向?qū)ο蟮模幊趟枷胧撬姓Z言通用的,這個不是一朝一夕的事,也沒辦法通過語言描述清楚。各位對號入座,以上哪個知識面不足的自己去翻資料學(xué)習(xí),我可是直接貼代碼的。
說到Python的學(xué)習(xí)還是多說兩句吧,以前學(xué)習(xí)Python爬蟲的時候看的是 @工匠若水 https://blog.csdn.net/yanbober的博客,這哥們的Python文章寫的真不錯,對于有過編程經(jīng)驗卻從沒接觸過Python的人很有幫助,基本上很快就能上手一個小項目。得嘞,擼代碼:
import url_manager import html_parser import html_download import persist_util from tkinter import * from threading import Thread import os class SpiderMain(object): def __init__(self): self.mUrlManager = url_manager.UrlManager() self.mHtmlParser = html_parser.HtmlParser() self.mHtmlDownload = html_download.HtmlDownload() self.mPersist = persist_util.PersistUtil() # 加載歷史下載鏈接 def load_history(self): history_download_links = self.mPersist.load_history_links() if history_download_links is not None and len(history_download_links) > 0: for download_link in history_download_links: self.mUrlManager.add_download_url(download_link) d_log("加載歷史下載鏈接: " + download_link) # 保存歷史下載鏈接 def save_history(self): history_download_links = self.mUrlManager.get_download_url() if history_download_links is not None and len(history_download_links) > 0: self.mPersist.save_history_links(history_download_links) def craw_movie_links(self, root_url, score=8): count = 0; self.mUrlManager.add_url(root_url) while self.mUrlManager.has_continue(): try: count = count + 1 url = self.mUrlManager.get_url() d_log("craw %d : %s" % (count, url)) headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36', 'Referer': url } content = self.mHtmlDownload.down_html(url, retry_count=3, headers=headers) if content is not None: doc = content.decode('gb2312', 'ignore') movie_urls, next_link = self.mHtmlParser.parser_movie_link(doc) if movie_urls is not None and len(movie_urls) > 0: for movie_url in movie_urls: d_log('movie info url: ' + movie_url) content = self.mHtmlDownload.down_html(movie_url, retry_count=3, headers=headers) if content is not None: doc = content.decode('gb2312', 'ignore') movie_name, movie_score, movie_xunlei_links = self.mHtmlParser.parser_movie_info(doc, score=score) if movie_xunlei_links is not None and len(movie_xunlei_links) > 0: for xunlei_link in movie_xunlei_links: # 判斷該電影是否已經(jīng)下載過了 is_download = self.mUrlManager.has_download(xunlei_link) if is_download == False: # 沒下載過的電影添加到迅雷下載列表 d_log('開始下載 ' + movie_name + ', 鏈接地址: ' + xunlei_link) self.mUrlManager.add_download_url(xunlei_link) os.system(r'"D:\迅雷\Thunder\Program\Thunder.exe" {url}'.format(url=xunlei_link)) # 每下載一部電影都實時更新數(shù)據(jù)庫,這樣可以保證即使程序異常退出也不會重復(fù)下載該電影 self.save_history() if next_link is not None: d_log('next link: ' + next_link) self.mUrlManager.add_url(next_link) except Exception as e: d_log('錯誤信息: ' + str(e)) def runner(rootLink=None, scoreLimit=None): if rootLink is None: return spider = SpiderMain() spider.load_history() if scoreLimit is None: spider.craw_movie_links(rootLink) else: spider.craw_movie_links(rootLink, score=float(scoreLimit)) spider.save_history() # rootLink = 'https://www.dytt8.net/html/gndy/dyzz/index.html' # rootLink = 'https://www.dytt8.net/html/gndy/dyzz/list_23_207.html' def start(rootLink, scoreLimit): loop_thread = Thread(target=runner, args=(rootLink, scoreLimit,), name='LOOP THREAD') #loop_thread.setDaemon(True) loop_thread.start() #loop_thread.join() # 不能讓主線程等待,否則GUI界面將卡死 btn_start.configure(state='disable') # 刷新GUI界面,文字滾動效果 def d_log(log): s = log + '\n' txt.insert(END, s) txt.see(END) if __name__ == "__main__": rootGUI = Tk() rootGUI.title('XX電影自動下載工具') # 設(shè)置窗體背景顏色 black_background = '#000000' rootGUI.configure(background=black_background) # 獲取屏幕寬度和高度 screen_w, screen_h = rootGUI.maxsize() # 居中顯示窗體 window_x = (screen_w - 640) / 2 window_y = (screen_h - 480) / 2 window_xy = '640x480+%d+%d' % (window_x, window_y) rootGUI.geometry(window_xy) lable_link = Label(rootGUI, text='解析根地址: ',\ bg='black',\ fg='red', \ font=('宋體', 12), \ relief=FLAT) lable_link.place(x=20, y=20) lable_link_width = lable_link.winfo_reqwidth() lable_link_height = lable_link.winfo_reqheight() input_link = Entry(rootGUI) input_link.place(x=20+lable_link_width, y=20, relwidth=0.5) lable_score = Label(rootGUI, text='電影評分限制: ', \ bg='black', \ fg='red', \ font=('宋體', 12), \ relief=FLAT) lable_score.place(x=20, y=20+lable_link_height+10) input_score = Entry(rootGUI) input_score.place(x=20+lable_link_width, y=20+lable_link_height+10, relwidth=0.3) btn_start = Button(rootGUI, text='開始下載', command=lambda: start(input_link.get(), input_score.get())) btn_start.place(relx=0.4, rely=0.2, relwidth=0.1, relheight=0.1) txt = Text(rootGUI) txt.place(rely=0.4, relwidth=1, relheight=0.5) rootGUI.mainloop()