簡介:
如果一個表中數(shù)據(jù)很多,我們查詢時就很慢,耗費大量時間,如果要查詢其中部分數(shù)據(jù)該怎么辦呢,這時我們引入分區(qū)的概念。
Hive中的分區(qū)表分為兩種:靜態(tài)分區(qū)和動態(tài)分區(qū)。
創(chuàng)新互聯(lián)建站長期為上1000家客戶提供的網站建設服務,團隊從業(yè)經驗10年,關注不同地域、不同群體,并針對不同對象提供差異化的產品和服務;打造開放共贏平臺,與合作伙伴共同營造健康的互聯(lián)網生態(tài)環(huán)境。為平利企業(yè)提供專業(yè)的成都網站制作、網站設計,平利網站改版等技術服務。擁有十年豐富建站經驗和眾多成功案例,為您定制開發(fā)。
PARTITIONED BY
創(chuàng)建分區(qū)表,一個表可以擁有一個或者多個分區(qū),每個分區(qū)以文件夾的形式單獨存在表文件夾的目錄下。單級分區(qū)表演示:
# 單分區(qū)表創(chuàng)建
hive> create table order_partition(
> ordernumber string,
> eventtime string
> )
> partitioned by (event_month string)
> row format delimited fields terminated by '\t';
OK
Time taken: 0.82 seconds
# 將order.txt 文件中的數(shù)據(jù)加載到order_partition表中
hive> load data local inpath '/home/hadoop/order.txt' overwrite into table order_partition partition (event_month='2014-05');
Loading data to table default.order_partition partition (event_month=2014-05)
Partition default.order_partition{event_month=2014-05} stats: [numFiles=1, numRows=0, totalSize=208, rawDataSize=0]
OK
Time taken: 1.749 seconds
# 查看order_partition分區(qū)數(shù)據(jù)
hive> select * from order_partition where event_month='2014-05';
OK
10703007267488 2014-05-01 06:01:12.334+01 2014-05
10101043505096 2014-05-01 07:28:12.342+01 2014-05
10103043509747 2014-05-01 07:50:12.33+01 2014-05
10103043501575 2014-05-01 09:27:12.33+01 2014-05
10104043514061 2014-05-01 09:03:12.324+01 2014-05
Time taken: 0.208 seconds, Fetched: 5 row(s)
# 在元數(shù)據(jù)MySQL中查看
mysql> select * from partitions;
+---------+-------------+------------------+---------------------+-------+--------+
| PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME | SD_ID | TBL_ID |
+---------+-------------+------------------+---------------------+-------+--------+
| 1 | 1530498328 | 0 | event_month=2014-05 | 32 | 31 |
+---------+-------------+------------------+---------------------+-------+--------+
1 row in set (0.00 sec)
mysql> select * from partition_key_vals;
+---------+--------------+-------------+
| PART_ID | PART_KEY_VAL | INTEGER_IDX |
+---------+--------------+-------------+
| 1 | 2014-05 | 0 |
+---------+--------------+-------------+
1 row in set (0.00 sec)
# HDFS中查看目錄
[hadoop@hadoop000 ~]$ hadoop fs -ls /user/hive/warehouse/order_partition/
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2018-07-02 10:29 /user/hive/warehouse/order_partition/event_month=2014-05
注:使用hadoop shell 加載數(shù)據(jù)也能加載數(shù)據(jù),下面進行演示:
創(chuàng)建分區(qū),也就是說在HDFS文件夾目錄下會有一個分區(qū)目錄,那么我們是不是直接可以在HDFS上創(chuàng)建一個目錄,再把數(shù)據(jù)加載進去呢?
# 創(chuàng)建目錄并上傳文件
[hadoop@hadoop000 ~]$ hadoop fs -mkdir -p /user/hive/warehouse/order_partition/event_month=2014-06
[hadoop@hadoop000 ~]$ hadoop fs -put /home/hadoop/order.txt /user/hive/warehouse/order_partition/event_month=2014-06
[hadoop@hadoop000 ~]$ hadoop fs -ls /user/hive/warehouse/order_partition/
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2018-07-02 10:29 /user/hive/warehouse/order_partition/event_month=2014-05
drwxr-xr-x - hadoop supergroup 0 2018-07-02 10:54 /user/hive/warehouse/order_partition/event_month=2014-06
# 發(fā)現(xiàn)分區(qū)表中沒有數(shù)據(jù)
hive> select * from order_partition where event_month='2014-06';
OK
Time taken: 0.21 seconds
# 原因是我們將文件上傳到了hdfs,hdfs是有了數(shù)據(jù),但hive中的元數(shù)據(jù)中還沒有,執(zhí)行如下命令更新
hive> msck repair table order_partition;
OK
Partitions not in metastore: order_partition:event_month=2014-06
Repair: Added partition to metastore order_partition:event_month=2014-06
Time taken: 0.178 seconds, Fetched: 2 row(s)
# 再次查看分區(qū)數(shù)據(jù)
hive> select * from order_partition where event_month='2014-06';
OK
10703007267488 2014-05-01 06:01:12.334+01 2014-06
10101043505096 2014-05-01 07:28:12.342+01 2014-06
10103043509747 2014-05-01 07:50:12.33+01 2014-06
10103043501575 2014-05-01 09:27:12.33+01 2014-06
10104043514061 2014-05-01 09:03:12.324+01 2014-06
Time taken: 0.257 seconds, Fetched: 5 row(s)
# 查看表分區(qū)
hive> show partitions order_partition;
OK
event_month=2014-05
event_month=2014-06
Time taken: 0.164 seconds, Fetched: 2 row(s)
注意: msck repair table
命令執(zhí)行后Hive會檢測如果HDFS目錄下存在 但表的metastore中不存在的partition元信息,更新到metastore中。如果有一張表已經存放好幾年了,用這個命令去執(zhí)行的話 半天都反應不了,所以這個命令太暴力了,生產中不推薦使用。可以用Add partition
來添加分區(qū)。
[hadoop@hadoop000 ~]$ hadoop fs -mkdir -p /user/hive/warehouse/order_partition/event_month=2014-07
[hadoop@hadoop000 ~]$ hadoop fs -put /home/hadoop/order.txt /user/hive/warehouse/order_partition/event_month=2014-07
[hadoop@hadoop000 ~]$ hadoop fs -ls /user/hive/warehouse/order_partition/
Found 3 items
drwxr-xr-x - hadoop supergroup 0 2018-07-02 10:29 /user/hive/warehouse/order_partition/event_month=2014-05
drwxr-xr-x - hadoop supergroup 0 2018-07-02 10:54 /user/hive/warehouse/order_partition/event_month=2014-06
drwxr-xr-x - hadoop supergroup 0 2018-07-02 11:09 /user/hive/warehouse/order_partition/event_month=2014-07
# 查看新的分區(qū)
hive> select * from order_partition where event_month='2014-07';
OK
Time taken: 0.188 seconds
# 添加分區(qū)
hive> ALTER TABLE order_partition ADD IF NOT EXISTS PARTITION (event_month='2014-07');
OK
Time taken: 0.22 seconds
# 再次查看
hive> select * from order_partition where event_month='2014-07';
OK
10703007267488 2014-05-01 06:01:12.334+01 2014-07
10101043505096 2014-05-01 07:28:12.342+01 2014-07
10103043509747 2014-05-01 07:50:12.33+01 2014-07
10103043501575 2014-05-01 09:27:12.33+01 2014-07
10104043514061 2014-05-01 09:03:12.324+01 2014-07
Time taken: 0.206 seconds, Fetched: 5 row(s)
hive> show partitions order_partition;
OK
event_month=2014-05
event_month=2014-06
event_month=2014-07
Time taken: 0.151 seconds, Fetched: 3 row(s)
多級分區(qū)表演示:
# 創(chuàng)建多級分區(qū)表
hive> create table order_mulit_partition(
> ordernumber string,
> eventtime string
> )
> partitioned by (event_month string,event_day string)
> row format delimited fields terminated by '\t';
OK
Time taken: 0.133 seconds
# 加載數(shù)據(jù)
hive> load data local inpath '/home/hadoop/order.txt' overwrite into table order_mulit_partition partition (event_month='2014-05',event_day=01);
# 查看分區(qū)
hive> select * from order_mulit_partition where event_month='2014-05' and event_day='01';
OK
10703007267488 2014-05-01 06:01:12.334+01 2014-05 01
10101043505096 2014-05-01 07:28:12.342+01 2014-05 01
10103043509747 2014-05-01 07:50:12.33+01 2014-05 01
10103043501575 2014-05-01 09:27:12.33+01 2014-05 01
10104043514061 2014-05-01 09:03:12.324+01 2014-05 01
hive> show partitions order_mulit_partition;
OK
event_month=2014-05/event_day=01
Time taken: 0.158 seconds, Fetched: 1 row(s)
# HDFS中多級分區(qū)的目錄結構
[hadoop@hadoop000 ~]$ hadoop fs -ls /user/hive/warehouse/order_mulit_partition/event_month=2014-05
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2018-07-02 11:17 /user/hive/warehouse/order_mulit_partition/event_month=2014-05/event_day=01
總結:單級分區(qū)在HDFS上文件目錄為單級;多分區(qū)在HDFS上文件目錄為多級。
參考:官方文檔
DP columns are specified the same way as it is for SP columns – in the partition clause. The only difference is that DP columns do not have values, while SP columns do. In the partition clause, we need to specify all partitioning columns, even if all of them are DP columns.
In INSERT ... SELECT ... queries, the dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order in which they appear in the PARTITION() clause.
簡單總結下區(qū)別:
1.DP列的指定方式與SP列相同 - 在分區(qū)子句中( Partition關鍵字后面),唯一的區(qū)別是,DP列沒有值,而SP列有值( Partition關鍵字后面只有key沒有value)
2.在INSERT … SELECT …查詢中,必須在SELECT語句中的列中最后指定動態(tài)分區(qū)列,并按PARTITION()子句中出現(xiàn)的順序進行排列
3.所有DP列 - 只允許在非嚴格模式下使用。 在嚴格模式下,我們應該拋出一個錯誤
下面舉幾個例子進行演示:
注意:為了演示動態(tài)分區(qū)與靜態(tài)分區(qū)的區(qū)別 并且對比出靜態(tài)分區(qū)的繁瑣,我們先對靜態(tài)分區(qū)進行操作 之后再演示動態(tài)分區(qū)。
# 創(chuàng)建員工靜態(tài)分區(qū)表
hive> CREATE TABLE emp_static_partition (
> empno int,
> ename string,
> job string,
> mgr int,
> hiredate string,
> salary double,
> comm double
> )
> PARTITIONED BY (deptno int)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 0.198 seconds
# 將emp表里的數(shù)據(jù)插入靜態(tài)分區(qū)
hive> select * from emp;
OK
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
Time taken: 0.164 seconds, Fetched: 14 row(s)
# 每個分區(qū)都要寫一條insert語句
hive> insert into table emp_static_partition partition(deptno=10)
> select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=10;
Query ID = hadoop_20180702100505_f0566585-06b2-4c53-910a-b6a58791fc2d
Total jobs = 3
Launching Job 1 out of 3
...
OK
Time taken: 15.265 seconds
hive> insert into table emp_static_partition partition(deptno=20)
> select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=20;
Query ID = hadoop_20180702100505_f0566585-06b2-4c53-910a-b6a58791fc2d
Total jobs = 3
Launching Job 1 out of 3
...
OK
Time taken: 18.527 seconds
hive> insert into table emp_static_partition partition(deptno=30)
> select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=30;
Query ID = hadoop_20180702100505_f0566585-06b2-4c53-910a-b6a58791fc2d
Total jobs = 3
Launching Job 1 out of 3
...
OK
Time taken: 14.062 seconds
# 查看各分區(qū)
hive> select * from emp_static_partition where deptno='10';
OK
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
Time taken: 0.219 seconds, Fetched: 3 row(s)
hive> select * from emp_static_partition where deptno='20';
OK
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
Time taken: 0.197 seconds, Fetched: 5 row(s)
hive> select * from emp_static_partition where deptno='30';
OK
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
Time taken: 0.181 seconds, Fetched: 6 row(s)
靜態(tài)分區(qū)表有一個非常致命的缺點,每次分區(qū)的插入都要單獨寫insert語句。
下面利用動態(tài)分區(qū)進行演示
演示前先進行設置:hive 中默認是靜態(tài)分區(qū),想要使用動態(tài)分區(qū),需要設置如下參數(shù),可以使用臨時設置,你也可以寫在配置文件(hive-site.xml)里,永久生效。臨時配置如下
set hive.exec.dynamic.partition=true; --開啟動態(tài)分區(qū) 默認為false,不開啟
set hive.exec.dynamic.partition.mode=nonstrict; --指定動態(tài)分區(qū)模式,默認為strict,即必須指定至少一個分區(qū)為靜態(tài)分區(qū),nonstrict模式表示允許所有的分區(qū)字段都可以使用動態(tài)分區(qū)
# 創(chuàng)建員工動態(tài)分區(qū)表,分區(qū)字段為deptno
hive> CREATE TABLE emp_dynamic_partition (
> empno int,
> ename string,
> job string,
> mgr int,
> hiredate string,
> salary double,
> comm double
> )
> PARTITIONED BY (deptno int)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 0.165 seconds
# insert一條語句搞定
hive> insert into table emp_dynamic_partition partition(deptno)
> select empno,ename ,job ,mgr ,hiredate ,salary ,comm, deptno from emp;
Query ID = hadoop_20180702100505_f0566585-06b2-4c53-910a-b6a58791fc2d
Total jobs = 3
Launching Job 1 out of 3
...
OK
Time taken: 17.982 seconds
# 查看各分區(qū)
hive> show partitions emp_dynamic_partition;
OK
deptno=10
deptno=20
deptno=30
Time taken: 0.176 seconds, Fetched: 3 row(s)
hive> select * from emp_dynamic_partition where deptno='10';
OK
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
Time taken: 2.662 seconds, Fetched: 3 row(s)
hive> select * from emp_dynamic_partition where deptno='20';
OK
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
Time taken: 0.178 seconds, Fetched: 5 row(s)
hive> select * from emp_dynamic_partition where deptno='30';
OK
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
Time taken: 0.146 seconds, Fetched: 6 row(s)
[hadoop@hadoop000 ~]$ hadoop fs -ls /user/hive/warehouse
Found 6 items
drwxr-xr-x - hadoop supergroup 0 2018-06-24 15:38 /user/hive/warehouse/emp
drwxr-xr-x - hadoop supergroup 0 2018-07-02 13:55 /user/hive/warehouse/emp_dynamic_partition
drwxr-xr-x - hadoop supergroup 0 2018-07-02 13:50 /user/hive/warehouse/emp_static_partition
drwxr-xr-x - hadoop supergroup 0 2018-07-02 11:17 /user/hive/warehouse/order_mulit_partition
drwxr-xr-x - hadoop supergroup 0 2018-07-02 11:09 /user/hive/warehouse/order_partition
drwxr-xr-x - hadoop supergroup 0 2018-06-24 15:35 /user/hive/warehouse/stu
[hadoop@hadoop000 ~]$ hadoop fs -ls /user/hive/warehouse/emp_static_partition
Found 3 items
drwxr-xr-x - hadoop supergroup 0 2018-07-02 13:47 /user/hive/warehouse/emp_static_partition/deptno=10
drwxr-xr-x - hadoop supergroup 0 2018-07-02 13:50 /user/hive/warehouse/emp_static_partition/deptno=20
drwxr-xr-x - hadoop supergroup 0 2018-07-02 13:51 /user/hive/warehouse/emp_static_partition/deptno=30
[hadoop@hadoop000 ~]$ hadoop fs -ls /user/hive/warehouse/emp_dynamic_partition
Found 3 items
drwxr-xr-x - hadoop supergroup 0 2018-07-02 13:55 /user/hive/warehouse/emp_dynamic_partition/deptno=10
drwxr-xr-x - hadoop supergroup 0 2018-07-02 13:55 /user/hive/warehouse/emp_dynamic_partition/deptno=20
drwxr-xr-x - hadoop supergroup 0 2018-07-02 13:55 /user/hive/warehouse/emp_dynamic_partition/deptno=30
補充:兩種分區(qū)還可以混合使用 下面做簡要了解:
hive> create table student(
> id int,
> name string,
> tel string,
> age int
> )
> row format delimited fields terminated by '\t';
OK
Time taken: 0.125 seconds
hive> insert into student values(1,'zhangsan','18311111111',20),(2,'lisi','18222222222',30),(3,'wangwu','15733333333',40);
Query ID = hadoop_20180702100505_f0566585-06b2-4c53-910a-b6a58791fc2d
Total jobs = 3
Launching Job 1 out of 3
...
OK
Time taken: 15.375 seconds
hive> select * from student;
OK
1 zhangsan 18311111111 20
2 lisi 18222222222 30
3 wangwu 15733333333 40
Time taken: 0.106 seconds, Fetched: 3 row(s)
# 創(chuàng)建混合分區(qū)表
hive> create table stu_mixed_partition(
> id int,
> name string,
> tel string
> )
> partitioned by (ds string,age int)
> row format delimited fields terminated by '\t';
OK
Time taken: 0.171 seconds
# 插入數(shù)據(jù)
hive> insert into stu_mixed_partition partition(ds='2010-03-03',age)
> select id,name,tel,age from student;
Query ID = hadoop_20180702100505_f0566585-06b2-4c53-910a-b6a58791fc2d
Total jobs = 3
Launching Job 1 out of 3
...
OK
Time taken: 18.887 seconds
# 查看分區(qū)
hive> show partitions stu_mixed_partition;
OK
ds=2010-03-03/age=20
ds=2010-03-03/age=30
ds=2010-03-03/age=40
hive> select * from stu_mixed_partition where ds='2010-03-03' and age=20;
OK
1 zhangsan 18311111111 2010-03-03 20
Time taken: 0.184 seconds, Fetched: 1 row(s)
hive> select * from stu_mixed_partition where ds='2010-03-03' and age=30;
OK
2 lisi 18222222222 2010-03-03 30
Time taken: 0.188 seconds, Fetched: 1 row(s)
hive> select * from stu_mixed_partition where ds='2010-03-03' and age=40;
OK
3 wangwu 15733333333 2010-03-03 40
Time taken: 0.186 seconds, Fetched: 1 row(s)
# 查看HDFS目錄
[hadoop@oradb3 ~]$ hadoop fs -ls /user/hive/warehouse/stu_mixed_partition/ds=2010-03-03
Found 3 items
drwxr-xr-x - hadoop supergroup 0 2018-07-02 14:10 /user/hive/warehouse/stu_mixed_partition/ds=2010-03-03/age=20
drwxr-xr-x - hadoop supergroup 0 2018-07-02 14:10 /user/hive/warehouse/stu_mixed_partition/ds=2010-03-03/age=30
drwxr-xr-x - hadoop supergroup 0 2018-07-02 14:10 /user/hive/warehouse/stu_mixed_partition/ds=2010-03-03/age=40