#python中的pandas庫主要有DataFrame和Series類(面向?qū)ο蟮牡恼Z言更愿意叫類) DataFrame也就是
創(chuàng)新互聯(lián)建站-專業(yè)網(wǎng)站定制、快速模板網(wǎng)站建設(shè)、高性價(jià)比沙灣網(wǎng)站開發(fā)、企業(yè)建站全套包干低至880元,成熟完善的模板庫,直接使用。一站式沙灣網(wǎng)站制作公司更省心,省錢,快速模板網(wǎng)站建設(shè)找我們,業(yè)務(wù)覆蓋沙灣地區(qū)。費(fèi)用合理售后完善,10余年實(shí)體公司更值得信賴。
#數(shù)據(jù)框(主要是借鑒R里面的data.frame),Series也就是序列 ,pandas底層是c寫的 性能很棒,有大神
#做過測(cè)試 處理億級(jí)別的數(shù)據(jù)沒問題,起性能可以跟同等配置的sas媲美
#DataFrame索引 df.loc是標(biāo)簽選取操作,df.iloc是位置切片操作
print(df[['row_names','Rape']])
df['行標(biāo)簽']
df.loc[行標(biāo)簽,列標(biāo)簽]
print(df.loc[0:2,['Rape','Murder']])
df.iloc[行位置,列位置]
df.iloc[1,1]#選取第二行,第二列的值,返回的為單個(gè)值
df.iloc[0,2],:]#選取第一行及第三行的數(shù)據(jù)
df.iloc[0:2,:]#選取第一行到第三行(不包含)的數(shù)據(jù)
df.iloc[:,1]#選取所有記錄的第一列的值,返回的為一個(gè)Series
df.iloc[1,:]#選取第一行數(shù)據(jù),返回的為一個(gè)Series
print(df.ix[1,1]) # 更廣義的切片方式是使用.ix,它自動(dòng)根據(jù)你給到的索引類型判斷是使用位置還是標(biāo)簽進(jìn)行切片
print(df.ix[0:2])
#DataFrame根據(jù)條件選取子集 類似于sas里面if、where ,R里面的subset之類的函數(shù)
df[df.Murder13]
df[(df.Murder10)(df.Rape30)]
df[df.sex==u'男']
#重命名 相當(dāng)于sas里面的rename R軟件中reshape包的中的rename
df.rename(columns={'A':'A_rename'})
df.rename(index={1:'other'})
#刪除列 相當(dāng)于sas中的drop R軟件中的test['col']-null
df.drop(['a','b'],axis=1) or del df[['a','b']]
#排序 相當(dāng)于sas里面的sort R軟件里面的df[order(x),]
df.sort(columns='C') #行排序 y軸上
df.sort(axis=1) #各個(gè)列之間位置排序 x軸上
#數(shù)據(jù)描述 相當(dāng)于sas中proc menas R軟件里面的summary
df.describe()
#生成新的一列 跟R里面有點(diǎn)類似
df['new_columns']=df['columns']
df.insert(1,'new_columns',df['B']) #效率最高
df.join(Series(df['columns'],name='new_columns'))
#列上面的追加 相當(dāng)于sas中的append R里面cbind()
df.append(df1,ignore_index=True)
pd.concat([df,df1],ignore_index=True)
#最經(jīng)典的join 跟sas和R里面的merge類似 跟sql里面的各種join對(duì)照
merge()
#刪除重行 跟sas里面nodukey R里面的which(!duplicated(df[])類似
df.drop_duplicated()
#獲取最大值 最小值的位置 有點(diǎn)類似矩陣?yán)锩娴姆椒?/p>
df.idxmin(axis=0 ) df.idxmax(axis=1) 0和1有什么不同 自己摸索去
#讀取外部數(shù)據(jù)跟sas的proc import R里面的read.csv等類似
read_excel() read_csv() read_hdf5() 等
與之相反的是df.to_excel() df.to_ecv()
#缺失值處理 個(gè)人覺得pandas中缺失值處理比sas和R方便多了
df.fillna(9999) #用9999填充
#鏈接數(shù)據(jù)庫 不多說 pandas里面主要用 MySQLdb
import MySQLdb
conn=MySQLdb.connect(host="localhost",user="root",passwd="",db="mysql",use_unicode=True,charset="utf8")
read_sql() #很經(jīng)典
#寫數(shù)據(jù)進(jìn)數(shù)據(jù)庫
df.to_sql('hbase_visit',con, flavor="mysql", if_exists='replace', index=False)
#groupby 跟sas里面的中的by R軟件中dplyr包中的group_by sql里面的group by功能是一樣的 這里不多說
#求啞變量
dumiper=pd.get_dummies(df['key'])
df['key'].join(dumpier)
#透視表 和交叉表 跟sas里面的proc freq步類似 R里面的aggrate和cast函數(shù)類似
pd.pivot_table()
pd.crosstab()
#聚合函數(shù)經(jīng)常跟group by一起組合用
df.groupby('sex').agg({'height':['mean','sum'],'weight':['count','min']})
#數(shù)據(jù)查詢過濾
test.query("0.2
將STK_ID中的值過濾出來
stk_list = ['600809','600141','600329']中的全部記錄過濾出來,命令是:rpt[rpt['STK_ID'].isin(stk_list)].
將dataframe中,某列進(jìn)行清洗的命令
刪除換行符:misc['product_desc'] = misc['product_desc'].str.replace('\n', '')
刪除字符串前后空格:df["Make"] = df["Make"].map(str.strip)
如果用模糊匹配的話,命令是:
rpt[rpt['STK_ID'].str.contains(r'^600[0-9]{3}$')]
對(duì)dataframe中元素,進(jìn)行類型轉(zhuǎn)換
df['2nd'] = df['2nd'].str.replace(',','').astype(int) df['CTR'] = df['CTR'].str.replace('%','').astype(np.float64)
#時(shí)間變換 主要依賴于datemie 和time兩個(gè)包
#其他的一些技巧
df2[df2['A'].map(lambda x:x.startswith('61'))] #篩選出以61開頭的數(shù)據(jù)
df2["Author"].str.replace(".+", "").head() #replace(".+", "")表示將字符串中以””開頭;以””結(jié)束的任意子串替換為空字符串
commits = df2["Name"].head(15)
print commits.unique(), len(commits.unique()) #獲的NAME的不同個(gè)數(shù),類似于sql里面count(distinct name)
#pandas中最核心 最經(jīng)典的函數(shù)apply map applymap
這個(gè)里面他的話這個(gè)是可以進(jìn)行進(jìn)行采樣的方法,然后再從他的點(diǎn)訓(xùn)練數(shù)據(jù)中轉(zhuǎn)換乘1000點(diǎn)的話,它都是里面是轉(zhuǎn)化的,數(shù)據(jù)比較多,所以所以的話工程量比較大。
Python-for-data-重新采樣和頻率轉(zhuǎn)換
重新采樣指的是將時(shí)間序列從一個(gè)頻率轉(zhuǎn)換到另一個(gè)頻率的過程。
但是也并不是所有的采樣方式都是屬于上面的兩種
pandas中使用resample方法來實(shí)現(xiàn)頻率轉(zhuǎn)換,下面是resample方法的參數(shù)詳解:
將數(shù)據(jù)聚合到一個(gè)規(guī)則的低頻上,例如將時(shí)間轉(zhuǎn)換為每個(gè)月,"M"或者"BM",將數(shù)據(jù)分成一個(gè)月的時(shí)間間隔。
每個(gè)間隔是半閉合的,一個(gè)數(shù)據(jù)只能屬于一個(gè)時(shí)間間隔。時(shí)間間隔的并集必須是整個(gè)時(shí)間幀
默認(rèn)情況下,左箱體邊界是包含的。00:00的值是00:00到00:05間隔內(nèi)的值
產(chǎn)生的時(shí)間序列按照每個(gè)箱體左邊的時(shí)間戳被標(biāo)記。
傳遞span class="mark"label="right"/span可以使用右箱體邊界標(biāo)記時(shí)間序列
向loffset參數(shù)傳遞字符串或者日期偏置
在金融數(shù)據(jù)中,為每個(gè)數(shù)據(jù)桶計(jì)算4個(gè)值是常見的問題:
通過span class="girk"ohlc聚合函數(shù)/span能夠得到四種聚合值列的DF數(shù)據(jù)
低頻轉(zhuǎn)到高頻的時(shí)候會(huì)形成缺失值
ffill() :使用前面的值填充, limit 限制填充的次數(shù)
奧巴馬的競(jìng)選資金是一點(diǎn)點(diǎn)從選民那里募集來的。如獲黨內(nèi)提名,可得政府拔款,但也沒多少。美國大選不僅禁外國人捐款,而且禁止公司機(jī)構(gòu)捐款,而只允許個(gè)人捐款。不僅如此,還為個(gè)人捐款限制了上限,防止富人捐過多的款而影響未來的公平執(zhí)政。
不僅富人自己不能多捐,如果某個(gè)老板呼吁自己的員工給某人捐錢或投票支持他,都是犯法的。因此,想要籌到幾千萬競(jìng)爭(zhēng)資金,唯一的辦法是爭(zhēng)取更多選民支持,一點(diǎn)點(diǎn)募集。所以,中國、公司、大筆捐款,這三條都是犯法的。
我記得以前已經(jīng)有華人鬧過這種丑聞了。美國的選舉法就是要嚴(yán)防少數(shù)人企圖用幾個(gè)臭錢影響美國的政治。所以我們作為外國人就更別去自討沒趣了。
導(dǎo)入包
In [1]:
import numpy as npimport pandas as pdfrom pandas import Series,DataFrame
方便大家操作,將月份和參選人以及所在政黨進(jìn)行定義
In [2]:
months = {'JAN' : 1, 'FEB' : 2, 'MAR' : 3, 'APR' : 4, 'MAY' : 5, 'JUN' : 6, 'JUL' : 7, 'AUG' : 8, 'SEP' : 9, 'OCT': 10, 'NOV': 11, 'DEC' : 12}of_interest = ['Obama, Barack', 'Romney, Mitt', 'Santorum, Rick', 'Paul, Ron', 'Gingrich, Newt']parties = { 'Bachmann, Michelle': 'Republican', 'Romney, Mitt': 'Republican', 'Obama, Barack': 'Democrat', "Roemer, Charles E. 'Buddy' III": 'Reform', 'Pawlenty, Timothy': 'Republican', 'Johnson, Gary Earl': 'Libertarian', 'Paul, Ron': 'Republican', 'Santorum, Rick': 'Republican', 'Cain, Herman': 'Republican', 'Gingrich, Newt': 'Republican', 'McCotter, Thaddeus G': 'Republican', 'Huntsman, Jon': 'Republican', 'Perry, Rick': 'Republican' }
讀取文件
In [3]:
table = pd.read_csv('data/usa_election.txt')table.head()
C:\jupyter\lib\site-packages\IPython\core\interactiveshell.py:2785: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
Out[3]:
cmte_id cand_id cand_nm contbr_nm contbr_city contbr_st contbr_zip contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text form_tp file_num
0 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM MOBILE AL 3.6601e 08 RETIRED RETIRED 250.0 20-JUN-11 NaN NaN NaN SA17A 736166
1 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM MOBILE AL 3.6601e 08 RETIRED RETIRED 50.0 23-JUN-11 NaN NaN NaN SA17A 736166
2 C00410118 P20002978 Bachmann, Michelle SMITH, LANIER LANETT AL 3.68633e 08 INFORMATION REQUESTED INFORMATION REQUESTED 250.0 05-JUL-11 NaN NaN NaN SA17A 749073
3 C00410118 P20002978 Bachmann, Michelle BLEVINS, DARONDA PIGGOTT AR 7.24548e 08 NONE RETIRED 250.0 01-AUG-11 NaN NaN NaN SA17A 749073
4 C00410118 P20002978 Bachmann, Michelle WARDENBURG, HAROLD HOT SPRINGS NATION AR 7.19016e 08 NONE RETIRED 300.0 20-JUN-11 NaN NaN NaN SA17A 736166
In [8]:
#使用map函數(shù) 字典,新建一列各個(gè)候選人所在黨派partytable['party'] = table['cand_nm'].map(parties)table.head()
Out[8]:
cmte_id cand_id cand_nm contbr_nm contbr_city contbr_st contbr_zip contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text form_tp file_num party
0 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM MOBILE AL 3.6601e 08 RETIRED RETIRED 250.0 20-JUN-11 NaN NaN NaN SA17A 736166 Republican
1 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM MOBILE AL 3.6601e 08 RETIRED RETIRED 50.0 23-JUN-11 NaN NaN NaN SA17A 736166 Republican
2 C00410118 P20002978 Bachmann, Michelle SMITH, LANIER LANETT AL 3.68633e 08 INFORMATION REQUESTED INFORMATION REQUESTED 250.0 05-JUL-11 NaN NaN NaN SA17A 749073 Republican
3 C00410118 P20002978 Bachmann, Michelle BLEVINS, DARONDA PIGGOTT AR 7.24548e 08 NONE RETIRED 250.0 01-AUG-11 NaN NaN NaN SA17A 749073 Republican
4 C00410118 P20002978 Bachmann, Michelle WARDENBURG, HAROLD HOT SPRINGS NATION AR 7.19016e 08 NONE RETIRED 300.0 20-JUN-11 NaN NaN NaN SA17A 736166 Republican
In [10]:
#party這一列中有哪些元素table['party'].unique()
Out[10]:
array(['Republican', 'Democrat', 'Reform', 'Libertarian'], dtype=object)
In [ ]:
#使用value_counts()函數(shù),統(tǒng)計(jì)party列中各個(gè)元素出現(xiàn)次數(shù),value_counts()是Series中的,無參,返回一個(gè)帶有每個(gè)元素出現(xiàn)次數(shù)的Series
In [11]:
table['party'].value_counts()
Out[11]:
Democrat 292400Republican 237575Reform 5364Libertarian 702Name: party, dtype: int64
In [12]:
#使用groupby()函數(shù),查看各個(gè)黨派收到的政治獻(xiàn)金總數(shù)contb_receipt_amttable.groupby(by='party')['contb_receipt_amt'].sum()
Out[12]:
partyDemocrat 8.105758e 07Libertarian 4.132769e 05Reform 3.390338e 05Republican 1.192255e 08Name: contb_receipt_amt, dtype: float64
In [13]:
#查看具體每天各個(gè)黨派收到的政治獻(xiàn)金總數(shù)contb_receipt_amt 。使用groupby([多個(gè)分組參數(shù)])table.groupby(by=['party','contb_receipt_dt'])['contb_receipt_amt'].sum()
Out[13]:
party contb_receipt_dtDemocrat 01-AUG-11 175281.00 01-DEC-11 651532.82 01-JAN-12 58098.80 01-JUL-11 165961.00 01-JUN-11 145459.00 01-MAY-11 82644.00 01-NOV-11 122529.87 01-OCT-11 148977.00 01-SEP-11 403297.62 02-AUG-11 164510.11 02-DEC-11 216056.96 02-JAN-12 89743.60 02-JUL-11 17105.00 02-JUN-11 422453.00 02-MAY-11 396675.00 02-NOV-11 147183.81 02-OCT-11 62605.62 02-SEP-11 137948.41 03-AUG-11 147053.02 03-DEC-11 81304.02 03-JAN-12 87406.97 03-JUL-11 5982.00 03-JUN-11 320176.20 03-MAY-11 261819.11 03-NOV-11 119304.56 03-OCT-11 363061.02 03-SEP-11 45598.00 04-APR-11 640235.12 04-AUG-11 598784.23 04-DEC-11 72795.10 ... Republican 29-AUG-11 941769.23 29-DEC-11 428501.42 29-JAN-11 750.00 29-JAN-12 75220.02 29-JUL-11 233423.35 29-JUN-11 1340704.29 29-MAR-11 38875.00 29-MAY-11 8363.20 29-NOV-11 407322.64 29-OCT-11 81924.01 29-SEP-11 1612794.52 30-APR-11 43004.80 30-AUG-11 915548.58 30-DEC-11 492470.45 30-JAN-12 255204.80 30-JUL-11 12249.04 30-JUN-11 2744932.63 30-MAR-11 50240.00 30-MAY-11 17803.60 30-NOV-11 809014.83 30-OCT-11 43913.16 30-SEP-11 4886331.76 31-AUG-11 1017735.02 31-DEC-11 1094376.72 31-JAN-11 6000.00 31-JAN-12 869890.41 31-JUL-11 12781.02 31-MAR-11 62475.00 31-MAY-11 301339.80 31-OCT-11 734601.83Name: contb_receipt_amt, Length: 1183, dtype: float64
In [14]:
def trasform_date(d): day,month,year = d.split('-') month = months[month] return "20" year '-' str(month) '-' day
In [17]:
#將表中日期格式轉(zhuǎn)換為'yyyy-mm-dd'。日期格式,通過函數(shù)加map方式進(jìn)行轉(zhuǎn)換table['contb_receipt_dt'] = table['contb_receipt_dt'].apply(trasform_date)
In [18]:
table.head()
Out[18]:
cmte_id cand_id cand_nm contbr_nm contbr_city contbr_st contbr_zip contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text form_tp file_num party
0 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM MOBILE AL 3.6601e 08 RETIRED RETIRED 250.0 2011-6-20 NaN NaN NaN SA17A 736166 Republican
1 C00410118 P20002978 Bachmann, Michelle HARVEY, WILLIAM MOBILE AL 3.6601e 08 RETIRED RETIRED 50.0 2011-6-23 NaN NaN NaN SA17A 736166 Republican
2 C00410118 P20002978 Bachmann, Michelle SMITH, LANIER LANETT AL 3.68633e 08 INFORMATION REQUESTED INFORMATION REQUESTED 250.0 2011-7-05 NaN NaN NaN SA17A 749073 Republican
3 C00410118 P20002978 Bachmann, Michelle BLEVINS, DARONDA PIGGOTT AR 7.24548e 08 NONE RETIRED 250.0 2011-8-01 NaN NaN NaN SA17A 749073 Republican
4 C00410118 P20002978 Bachmann, Michelle WARDENBURG, HAROLD HOT SPRINGS NATION AR 7.19016e 08 NONE RETIRED 300.0 2011-6-20 NaN NaN NaN SA17A 736166 Republican
In [19]:
#查看老兵(捐獻(xiàn)者職業(yè))DISABLED VETERAN主要支持誰 :查看老兵們捐贈(zèng)給誰的錢最多table['contbr_occupation'] == 'DISABLED VETERAN'
Out[19]:
0 False1 False2 False3 False4 False5 False6 False7 False8 False9 False10 False11 False12 False13 False14 False15 False16 False17 False18 False19 False20 False21 False22 False23 False24 False25 False26 False27 False28 False29 False ... 536011 False536012 False536013 False536014 False536015 False536016 False536017 False536018 False536019 False536020 False536021 False536022 False536023 False536024 False536025 False536026 False536027 False536028 False536029 False536030 False536031 False536032 False536033 False536034 False536035 False536036 False536037 False536038 False536039 False536040 FalseName: contbr_occupation, Length: 536041, dtype: bool
In [21]:
old_bing_df = table.loc[table['contbr_occupation'] == 'DISABLED VETERAN']
In [22]:
old_bing_df.groupby(by='cand_nm')['contb_receipt_amt'].sum()
Out[22]:
cand_nmCain, Herman 300.00Obama, Barack 4205.00Paul, Ron 2425.49Santorum, Rick 250.00Name: contb_receipt_amt, dtype: float64
In [23]:
table['contb_receipt_amt'].max()
Out[23]:
1944042.43
In [24]:
#找出候選人的捐贈(zèng)者中,捐贈(zèng)金額最大的人的職業(yè)以及捐獻(xiàn)額 .通過query("查詢條件來查找捐獻(xiàn)人職業(yè)")table.query('contb_receipt_amt == 1944042.43')
Out[24]:
cmte_id cand_id cand_nm contbr_nm contbr_city contbr_st contbr_zip contbr_employer contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text form_tp file_num party
176127 C00431445 P80003338 Obama, Barack OBAMA VICTORY FUND 2012 - UNITEMIZED CHICAGO IL 60680 NaN NaN 1944042.43 2011-12-31 NaN X * SA18 763233 Democrat
來源: