平時(shí)當(dāng)我們需要爬取一些我們需要的數(shù)據(jù)時(shí),總是有些網(wǎng)站禁止同一IP重復(fù)訪問(wèn),這時(shí)候我們就應(yīng)該使用代理IP,每次訪問(wèn)前偽裝自己,讓“敵人”無(wú)法察覺(jué)。
成都創(chuàng)新互聯(lián)公司網(wǎng)站建設(shè)公司一直秉承“誠(chéng)信做人,踏實(shí)做事”的原則,不欺瞞客戶,是我們最起碼的底線! 以服務(wù)為基礎(chǔ),以質(zhì)量求生存,以技術(shù)求發(fā)展,成交一個(gè)客戶多一個(gè)朋友!專(zhuān)注中小微企業(yè)官網(wǎng)定制,成都網(wǎng)站設(shè)計(jì)、網(wǎng)站制作、外貿(mào)營(yíng)銷(xiāo)網(wǎng)站建設(shè),塑造企業(yè)網(wǎng)絡(luò)形象打造互聯(lián)網(wǎng)企業(yè)效應(yīng)。oooooooooooooooOK,讓我們愉快的開(kāi)始吧!
這個(gè)是獲取代理ip的文件,我將它們模塊化,分為三個(gè)函數(shù)
注:文中會(huì)有些英文注釋?zhuān)菫榱藢?xiě)代碼方便,畢竟英文一兩個(gè)單詞就ok了
#!/usr/bin/python #-*- coding:utf-8 -*- """ author:dasuda """ import urllib2 import re import socket import threading findIP = [] #獲取的原始IP數(shù)據(jù) IP_data = [] #拼接端口后的IP數(shù)據(jù) IP_data_checked = [] #檢查可用性后的IP數(shù)據(jù) findPORT = [] #IP對(duì)應(yīng)的端口 available_table = [] #可用IP的索引 def getIP(url_target): patternIP = re.compile(r'(?<=)[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}') patternPORT = re.compile(r'(?<= )[\d]{2,5}(?= )') print "now,start to refresh proxy IP..." for page in range(1,4): url = 'http://www.xicidaili.com/nn/'+str(page) headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64)"} request = urllib2.Request(url=url, headers=headers) response = urllib2.urlopen(request) content = response.read() findIP = re.findall(patternIP,str(content)) findPORT = re.findall(patternPORT,str(content)) #assemble the ip and port for i in range(len(findIP)): findIP[i] = findIP[i] + ":" + findPORT[i] IP_data.extend(findIP) print('get page', page) print "refresh done!!!" #use multithreading mul_thread_check(url_target) return IP_data_checked def check_one(url_check,i): #get lock lock = threading.Lock() #setting timeout socket.setdefaulttimeout(8) try: ppp = {"http":IP_data[i]} proxy_support = urllib2.ProxyHandler(ppp) openercheck = urllib2.build_opener(proxy_support) urllib2.install_opener(openercheck) request = urllib2.Request(url_check) request.add_header('User-Agent',"Mozilla/5.0 (Windows NT 10.0; WOW64)") html = urllib2.urlopen(request).read() lock.acquire() print(IP_data[i],'is OK') #get available ip index available_table.append(i) lock.release() except Exception as e: lock.acquire() print('error') lock.release() def mul_thread_check(url_mul_check): threads = [] for i in range(len(IP_data)): #creat thread... thread = threading.Thread(target=check_one, args=[url_mul_check,i,]) threads.append(thread) thread.start() print "new thread start",i for thread in threads: thread.join() #get the IP_data_checked[] for error_cnt in range(len(available_table)): aseemble_ip = {'http': IP_data[available_table[error_cnt]]} IP_data_checked.append(aseemble_ip) print "available proxy ip:",len(available_table)