hive分區(qū)和桶操作介紹

本篇內(nèi)容主要講解“hive分區(qū)和桶操作介紹”，感興趣的朋友不妨來(lái)看看。本文介紹的方法操作簡(jiǎn)單快捷，實(shí)用性強(qiáng)。下面就讓小編來(lái)帶大家學(xué)習(xí)“hive分區(qū)和桶操作介紹”吧!

創(chuàng)新互聯(lián)主要從事成都做網(wǎng)站、成都網(wǎng)站建設(shè)、網(wǎng)頁(yè)設(shè)計(jì)、企業(yè)做網(wǎng)站、公司建網(wǎng)站等業(yè)務(wù)。立足成都服務(wù)桑植,10多年網(wǎng)站建設(shè)經(jīng)驗(yàn),價(jià)格優(yōu)惠、服務(wù)專業(yè),歡迎來(lái)電咨詢建站服務(wù):18982081108

分區(qū)操作

Hive 的分區(qū)通過(guò)在創(chuàng)建表時(shí)啟動(dòng) PARTITION BY 實(shí)現(xiàn)，用來(lái)分區(qū)的維度并不是實(shí)際數(shù)據(jù)的某一列，具體分區(qū)的標(biāo)志是由插入內(nèi)容時(shí)給定的。當(dāng)要查詢某一分區(qū)的內(nèi)容時(shí)可以采用 WHERE 語(yǔ)句，例如使用 “WHERE tablename.partition_key>a” 創(chuàng)建含分區(qū)的表。創(chuàng)建分區(qū)語(yǔ)法如下。

CREATE TABLE table_name(...)PARTITION BY (dt STRING,country STRING)

1、創(chuàng)建分區(qū)

Hive 中創(chuàng)建分區(qū)表沒(méi)有什么復(fù)雜的分區(qū)類型（范圍分區(qū)、列表分區(qū)、hash 分區(qū)，混合分區(qū)等）。分區(qū)列也不是表中的一個(gè)實(shí)際的字段，而是一個(gè)或者多個(gè)偽列。意思是說(shuō)，在表的數(shù)據(jù)文件中實(shí)際并不保存分區(qū)列的信息與數(shù)據(jù)。

創(chuàng)建一個(gè)簡(jiǎn)單的分區(qū)表。

hive> create table partition_test(member_id string,name string) partitioned by (stat_date string,province string) row format delimited fields terminated by ',';

這個(gè)例子中創(chuàng)建了 stat_date 和 province 兩個(gè)字段作為分區(qū)列。通常情況下需要預(yù)先創(chuàng)建好分區(qū)，然后才能使用該分區(qū)。例如：

hive> alter table partition_test add partition (stat_date='2015-01-18',province='beijing');

這樣就創(chuàng)建了一個(gè)分區(qū)。這時(shí)會(huì)看到 Hive 在HDFS 存儲(chǔ)中創(chuàng)建了一個(gè)相應(yīng)的文件夾。

$ hadoop fs -ls /user/hive/warehouse/partition_test/stat_date=2015-01-18
/user/hive/warehouse/partition_test/stat_date=2015-01-18/province=beijing----顯示剛剛創(chuàng)建的分區(qū)

每一個(gè)分區(qū)都會(huì)有一個(gè)獨(dú)立的文件夾，在上面例子中，stat_date 是主層次，province 是副層次。

2、插入數(shù)據(jù)

使用一個(gè)輔助的非分區(qū)表 partition_test_input 準(zhǔn)備向 partition_test 中插入數(shù)據(jù)，實(shí)現(xiàn)步驟如下。

1) 查看 partition_test_input 表的結(jié)構(gòu)，命令如下。

hive> desc partition_test_input;

2) 查看 partition_test_input 的數(shù)據(jù)，命令如下。

hive> select * from partition_test_input;

3) 向 partition_test 的分區(qū)中插入數(shù)據(jù)，命令如下。

insert overwrite table partition_test partition(stat_date='2015-01-18',province='jiangsu') select member_id,name from partition_test_input where stat_date='2015-01-18' and province='jiangsu';

向多個(gè)分區(qū)插入數(shù)據(jù)，命令如下。

hive> from partition_test_input insert overwrite table partition_test partition(stat_date='2015-01-18',province='jiangsu') select member_id,name from partition_test_input where stat_date='2015-01-18' and province='jiangsu' insert overwrite table partition_test partition(stat_date='2015-01-28',province='sichuan') select member_id,name from partition_test_input where stat_date='2015-01-28' and province='sichuan' insert overwrite table partition_test partition(stat_date='2015-01-28',province='beijing') select member_id,name from partition_test_input where stat_date='2015-01-28' and province='beijing';

3、動(dòng)態(tài)分區(qū)

按照上面的方法向分區(qū)表中插入數(shù)據(jù)，如果數(shù)據(jù)源很大，針對(duì)一個(gè)分區(qū)就要寫(xiě)一個(gè) insert ，非常麻煩。使用動(dòng)態(tài)分區(qū)可以很好地解決上述問(wèn)題。動(dòng)態(tài)分區(qū)可以根據(jù)查詢得到的數(shù)據(jù)自動(dòng)匹配到相應(yīng)的分區(qū)中去。

動(dòng)態(tài)分區(qū)可以通過(guò)下面的設(shè)置來(lái)打開(kāi)：

set hive.exec.dynamic.partition=true;set hive.exec.dynamic.partition.mode=nonstrict;

動(dòng)態(tài)分區(qū)的使用方法很簡(jiǎn)單，假設(shè)向 stat_date='2015-01-18' 這個(gè)分區(qū)下插入數(shù)據(jù)，至于 province 插到哪個(gè)子分區(qū)下讓數(shù)據(jù)庫(kù)自己來(lái)判斷。stat_date 叫做靜態(tài)分區(qū)列，province 叫做動(dòng)態(tài)分區(qū)列。

hive> insert overwrite table partition_test partition(stat_date='2015-01-18',province) select member_id,name province from partition_test_input where stat_date='2015-01-18';

注意，動(dòng)態(tài)分區(qū)不允許主分區(qū)采用動(dòng)態(tài)列而副分區(qū)采用靜態(tài)列，這樣將導(dǎo)致所有的主分區(qū)都要?jiǎng)?chuàng)建副分區(qū)靜態(tài)列所定義的分區(qū)。

hive.exec.max.dynamic.partitions.pernode：每一個(gè) MapReduce Job 允許創(chuàng)建的分區(qū)的最大數(shù)量，如果超過(guò)這個(gè)數(shù)量就會(huì)報(bào)錯(cuò)（默認(rèn)值100）。

hive.exec.max.dynamic.partitions：一個(gè) dml 語(yǔ)句允許創(chuàng)建的所有分區(qū)的最大數(shù)量（默認(rèn)值100）。

hive.exec.max.created.files：所有 MapReduce Job 允許創(chuàng)建的文件的最大數(shù)量（默認(rèn)值10000）。

盡量讓分區(qū)列的值相同的數(shù)據(jù)在同一個(gè) MapReduce 中，這樣每一個(gè) MapReduce 可以盡量少地產(chǎn)生新的文件夾，可以通過(guò) DISTRIBUTE BY 將分區(qū)列值相同的數(shù)據(jù)放到一起，命令如下。

hive> insert overwrite table partition_test partition(stat_date,province)select memeber_id,name,stat_date,province from partition_test_input distribute by stat_date,province;

桶操作

Hive 中 table 可以拆分成 Partition table 和桶（BUCKET），桶操作是通過(guò) Partition 的 CLUSTERED BY 實(shí)現(xiàn)的，BUCKET 中的數(shù)據(jù)可以通過(guò) SORT BY 排序。

BUCKET 主要作用如下。

1)數(shù)據(jù) sampling；

2)提升某些查詢操作效率，例如 Map-Side Join。

需要特別主要的是，CLUSTERED BY 和 SORT BY 不會(huì)影響數(shù)據(jù)的導(dǎo)入，這意味著，用戶必須自己負(fù)責(zé)數(shù)據(jù)的導(dǎo)入，包括數(shù)據(jù)額分桶和排序。 'set hive.enforce.bucketing=true' 可以自動(dòng)控制上一輪 Reduce 的數(shù)量從而適配 BUCKET 的個(gè)數(shù)，當(dāng)然，用戶也可以自主設(shè)置 mapred.reduce.tasks 去適配 BUCKET 個(gè)數(shù)，推薦使用：

hive> set hive.enforce.bucketing=true;

操作示例如下。

1) 創(chuàng)建臨時(shí)表 student_tmp，并導(dǎo)入數(shù)據(jù)。

hive> desc student_tmp;hive> select * from student_tmp;

2) 創(chuàng)建 student 表。

hive> create table student(id int,age int,name string)partitioned by (stat_date string)clustered by (id) sorted by(age) into 2 bucketrow format delimited fields terminated by ',';

3) 設(shè)置環(huán)境變量。

hive> set hive.enforce.bucketing=true;

4) 插入數(shù)據(jù)。

hive> from student_tmp insert overwrite table student partition(stat_date='2015-01-19') select id,age,name where stat_date='2015-01-18' sort by age;

5) 查看文件目錄。

$ hadoop fs -ls /usr/hive/warehouse/student/stat_date=2015-01-19/

6) 查看 sampling 數(shù)據(jù)。

hive> select * from student tablesample(bucket 1 out of 2 on id);

tablesample 是抽樣語(yǔ)句，語(yǔ)法如下。

tablesample(bucket x out of y)

y 必須是 table 中 BUCKET 總數(shù)的倍數(shù)或者因子。

到此，相信大家對(duì)“hive分區(qū)和桶操作介紹”有了更深的了解，不妨來(lái)實(shí)際操作一番吧！這里是創(chuàng)新互聯(lián)網(wǎng)站，更多相關(guān)內(nèi)容可以進(jìn)入相關(guān)頻道進(jìn)行查詢，關(guān)注我們，繼續(xù)學(xué)習(xí)！

名稱欄目：hive分區(qū)和桶操作介紹
分享URL：http://weahome.cn/article/gogeie.html

真实的国产乱ⅩXXX66竹夫人,五月香六月婷婷激情综合,亚洲日本VA一区二区三区,亚洲精品一区二区三区麻豆

hive分區(qū)和桶操作介紹

分區(qū)操作

桶操作

其他資訊

網(wǎng)站制作

企業(yè)服務(wù)

網(wǎng)站建設(shè)

服務(wù)器托管