Hive基本操作

01.Hive是什么

1. Hive介紹

Hive是基于Hadoop的一個(gè)數(shù)據(jù)倉(cāng)庫(kù)工具，可以將結(jié)構(gòu)化的數(shù)據(jù)文件映射為一張數(shù)據(jù)庫(kù)表，并提供類SQL查詢功能。
創(chuàng)新互聯(lián)公司制作網(wǎng)站網(wǎng)頁(yè)找三站合一網(wǎng)站制作公司,專注于網(wǎng)頁(yè)設(shè)計(jì),成都做網(wǎng)站、成都網(wǎng)站建設(shè),網(wǎng)站設(shè)計(jì),企業(yè)網(wǎng)站搭建,網(wǎng)站開發(fā),建網(wǎng)站業(yè)務(wù),680元做網(wǎng)站,已為成百上千服務(wù),創(chuàng)新互聯(lián)公司網(wǎng)站建設(shè)將一如既往的為我們的客戶提供最優(yōu)質(zhì)的網(wǎng)站建設(shè)、網(wǎng)絡(luò)營(yíng)銷推廣服務(wù)!
Hive是SQL解析引擎，它將SQL語(yǔ)句轉(zhuǎn)譯成M/R Job然后在Hadoop執(zhí)行。

2. Hive架構(gòu)

用戶接口，包括 CLI，JDBC/ODBC，WebUI
元數(shù)據(jù)存儲(chǔ)，通常是存儲(chǔ)在關(guān)系數(shù)據(jù)庫(kù)如 MySQL, derby 中
解釋器、編譯器、優(yōu)化器、執(zhí)行器
Hadoop：用 HDFS 進(jìn)行存儲(chǔ)，利用 MapReduce 進(jìn)行計(jì)算

Ps:Hive的元數(shù)據(jù)并不存放在hdfs上，而是存儲(chǔ)在數(shù)據(jù)庫(kù)中(metastore)，目前只支持 MySQL、derby。Hive 中的元數(shù)據(jù)包括表的名字，表的列和分區(qū)及其屬性，表的屬性（是否為外部表等），表的數(shù)據(jù)所在目錄等。

元數(shù)據(jù)就是描述數(shù)據(jù)的數(shù)據(jù)，而Hive的數(shù)據(jù)存儲(chǔ)在Hadoop HDFS
數(shù)據(jù)還是原來(lái)的文本數(shù)據(jù)，但是現(xiàn)在有了個(gè)目錄規(guī)劃。

3. Hive與Hadoop的關(guān)系

Hive利用HDFS存儲(chǔ)數(shù)據(jù)，利用MapReduce查詢數(shù)據(jù)。

4. Hive安裝部署

Hive只是一個(gè)工具，不需要集群配置。

export HIVE_HOME=/usr/local/hive-2.0.1
export PATH=PATH:PATH:HIVE_HOME/bin
配置MySql，如果不進(jìn)行配置，默認(rèn)使用derby數(shù)據(jù)庫(kù)，但是不好用，在哪個(gè)地方執(zhí)行./hive命令，哪兒就會(huì)創(chuàng)建一個(gè)metastore_db
MySQL安裝到其中某一個(gè)節(jié)點(diǎn)上即可。

5. Hive的thrift服務(wù)

可以安裝在某一個(gè)節(jié)點(diǎn)，并發(fā)布成標(biāo)準(zhǔn)服務(wù)，在其他節(jié)點(diǎn)使用beeline方法。

啟動(dòng)方式，（假如是在master上）：
啟動(dòng)為前臺(tái)服務(wù)：bin/hiveserver2
啟動(dòng)為后臺(tái)：nohup bin/hiveserver2 1>/var/log/hiveserver.log 2>/var/log/hiveserver.err &

連接方法：
hive/bin/beeline 回車，進(jìn)入beeline的命令界面
輸入命令連接hiveserver2
beeline> !connect jdbc:hive2://master:10000
beeline> !connect jdbc:hive2://localhost:10000
（master是hiveserver2所啟動(dòng)的那臺(tái)主機(jī)名，端口默認(rèn)是10000）

02.Hive的基本操作

1. 創(chuàng)建數(shù)據(jù)庫(kù)

hive > create database tabletest;

建立一個(gè)新數(shù)據(jù)庫(kù)，就會(huì)在HDFS的/user/hive/warehouse/中生成一個(gè)tabletest.db文件夾。
如果不創(chuàng)建新數(shù)據(jù)庫(kù)，不使用hive>use <數(shù)據(jù)庫(kù)名>，系統(tǒng)默認(rèn)的數(shù)據(jù)庫(kù)?？梢燥@式使用hive> use default;默認(rèn)/user/hive/warehouse/中建表

2. 創(chuàng)建表

語(yǔ)法：

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name 
[(col_name data_type [COMMENT col_comment], ...)] 
[COMMENT table_comment] 
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] 
[CLUSTERED BY (col_name, col_name, ...) 
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] 
[ROW FORMAT row_format] 
[STORED AS file_format] 
[LOCATION hdfs_path]

示例：

create table t_order(id int,name string,rongliang string,price double) 
row format delimited fields terminated by '\t'；

創(chuàng)建了一個(gè)t_order表，對(duì)應(yīng)在Mysql的元數(shù)據(jù)中TBLS表會(huì)增加表的信息，和列的信息，如下：

同時(shí)，會(huì)在HDFS的中的tabletest.db文件夾中增加一個(gè)t_order文件夾。所有的 Table 數(shù)據(jù)（不包括 External Table）都保存在這個(gè)目錄中。

3. 導(dǎo)入數(shù)據(jù)

可以直接使用HDFS上傳文件到t_order文件夾中，或者使用Hive的load命令。

load data local inpath '/home/hadoop/ip.txt' [OVERWRITE] into table tab_ext;

作用和上傳本地Linux文件到HDFS系統(tǒng)一樣；但需要注意，如果inpath 后面路徑是HDFS路徑，則將是將其刪除后，剪切到目標(biāo)文件夾，不好！

4. external表

EXTERNAL關(guān)鍵字可以讓用戶創(chuàng)建一個(gè)外部表，在建表的同時(shí)指定一個(gè)指向?qū)嶋H數(shù)據(jù)的路徑（LOCATION），Hive 創(chuàng)建內(nèi)部表時(shí)，會(huì)將數(shù)據(jù)移動(dòng)到數(shù)據(jù)倉(cāng)庫(kù)指向的路徑；若創(chuàng)建外部表，僅記錄數(shù)據(jù)所在的路徑，不對(duì)數(shù)據(jù)的位置做任何改變。
為了避免源文件丟失的問題，可以建立external表，數(shù)據(jù)源可以在任意位置。

CREATE EXTERNAL TABLE tab_ip_ext(id int, name string,
     ip STRING,
     country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
 STORED AS TEXTFILE
 LOCATION '/external/hive';

在創(chuàng)建表的時(shí)候，就指定了HDFS文件路徑，因此，源文件上傳到/external/hive/文件夾即可。

外部表刪除時(shí)，只刪除元數(shù)據(jù)信息，存儲(chǔ)在HDFS中的數(shù)據(jù)沒有被刪除。

5. Partition表（分區(qū)表）

作用：如果文件很大，用分區(qū)表可以快過濾出按分區(qū)字段劃分的數(shù)據(jù)。
t_order中有兩個(gè)分區(qū)part1和part2.
實(shí)際就是在HDFS中的t_order文件夾下面再建立兩個(gè)文件夾，每個(gè)文件名就是part1和part2。

create table t_order(id int,name string,rongliang string,price double)
partitioned by (part_flag string)row format delimited fields terminated by '\t';

插入數(shù)據(jù)：

load data local inpath '/home/hadoop/ip.txt' overwrite into table t_orderpartition(part_flag='part1');

就會(huì)把ip.txt文件上傳到/t_order/part_flag='part1'/文件夾中。

查看分區(qū)表中的數(shù)據(jù)：

select * from t_order  where part_flag='part1';

查詢時(shí)，分區(qū)字段也會(huì)顯示，但是實(shí)際表中是沒有這個(gè)字段的，是偽字段。

hive中的分區(qū) 就是再多建一個(gè)目錄，優(yōu)點(diǎn)：便于統(tǒng)計(jì)，效率更高，縮小數(shù)據(jù)集。

相關(guān)命令：

SHOW TABLES; # 查看所有的表
SHOW TABLES 'TMP'; #支持模糊查詢
SHOW PARTITIONS TMP_TABLE; #查看表有哪些分區(qū)
DESCRIBE TMP_TABLE; #查看表結(jié)構(gòu)

6. 分桶表

Hive里的分桶=MapReduce中的分區(qū)，而Hive中的分區(qū)只是將數(shù)據(jù)分到了不同的文件夾。

1. 創(chuàng)建分桶表

create table stu_buck(Sno int,Sname string,Sex string,Sage int,Sdept string)
clustered by(Sno) 
sorted by(Sno DESC)into 4 bucketsrow format delimitedfields terminated by ',';

含義：根據(jù)Sno字段分桶，每個(gè)桶按照Sno字段局部有序，4個(gè)桶。建桶的時(shí)候不會(huì)的數(shù)據(jù)做處理，只是要求插入的數(shù)據(jù)是被分好桶的。

2. 分桶表內(nèi)插入數(shù)據(jù)

一般不適用load數(shù)據(jù)進(jìn)入分桶表，因?yàn)閘oad進(jìn)入的數(shù)據(jù)不會(huì)自動(dòng)按照分桶規(guī)則分為四個(gè)小文件。所以，一般使用select查詢向分桶表插入文件。

設(shè)置變量,設(shè)置分桶為true, 設(shè)置reduce數(shù)量是分桶的數(shù)量個(gè)數(shù)

set hive.enforce.bucketing = true;set mapreduce.job.reduces=4;insert overwrite table student_buckselect * from student cluster by(Sno);insert overwrite table stu_buckselect * from student distribute by(Sno) sort by(Sno asc);

其中，可以用distribute by(sno) sort by(sno asc)替代cluster by(Sno)，這是等價(jià)的。cluster by(Sno) = 分桶+排序
先分發(fā)，再局部排序。區(qū)別是distribute更加靈活，可以根據(jù)一個(gè)字段分區(qū)，另外字段排序。
第二個(gè)子查詢的輸出了4個(gè)文件作為主查詢的輸入。

3. 分桶表的原理與作用

原理：
Hive是針對(duì)某一列進(jìn)行桶的組織。Hive采用對(duì)列值哈希，然后除以桶的個(gè)數(shù)求余的方式?jīng)Q定該條記錄存放在哪個(gè)桶當(dāng)中。（原理和MapReduce的getPartition方法相同）

作用：
（1）最大的作用是用來(lái)提高join操作的效率；
前提是兩個(gè)都是分桶表且分桶數(shù)量相同或者倍數(shù)關(guān)系？

思考這個(gè)問題：
select a.id,a.name,b.addr from a join b on a.id = b.id;
如果a表和b表已經(jīng)是分桶表，而且分桶的字段是id字段
做這個(gè)join操作時(shí)，還需要全表做笛卡爾積嗎？
對(duì)于JOIN操作兩個(gè)表有一個(gè)相同的列，如果對(duì)這兩個(gè)表都進(jìn)行了桶操作。那么將保存相同列值的桶進(jìn)行JOIN操作就可以，可以大大較少JOIN的數(shù)據(jù)量。

（2）取樣（sampling）更高效。在處理大規(guī)模數(shù)據(jù)集時(shí)，在開發(fā)和修改查詢的階段，如果能在數(shù)據(jù)集的一小部分?jǐn)?shù)據(jù)上試運(yùn)行查詢，會(huì)帶來(lái)很多方便。

7. insert語(yǔ)句

Hive一條一條的insert太慢。
但是可以批量的insert.實(shí)際就是想文件夾中追加文件。

create table tab_ip_like like tab_ip;insert overwrite table tab_ip_likeselect * from tab_ip;

向tab_ip_like中追加文件

8. 保存select查詢結(jié)果的幾種方式

1、將查詢結(jié)果保存到一張新的hive表中
create table t_tmp
as
select * from t_p;

2、將查詢結(jié)果保存到一張已經(jīng)存在的hive表中
insert into table t_tmp
select * from t_p;

3、將查詢結(jié)果保存到指定的文件目錄（可以是本地，也可以是HDFS）
insert overwrite local directory '/home/hadoop/test'
select * from t_p;

插入HDFS
insert overwrite directory '/aaa/test'
select * from t_p;

9. 查看、刪除

語(yǔ)法：

SELECT [ALL | DISTINCT] select_expr, select_expr, ... 
FROM table_reference
[WHERE where_condition] 
[GROUP BY col_list [HAVING condition]] 
[CLUSTER BY col_list 
| [DISTRIBUTE BY col_list] [SORT BY| ORDER BY col_list] 
] 
[LIMIT number]

注：

CLUSTER BY字段含義：根據(jù)這個(gè)字段進(jìn)行分區(qū)，需要注意設(shè)置reduce_num數(shù)量。
order by 會(huì)對(duì)輸入做全局排序，因此只有一個(gè)reducer，會(huì)導(dǎo)致當(dāng)輸入規(guī)模較大時(shí)，需要較長(zhǎng)的計(jì)算時(shí)間。
sort by不是全局排序，其在數(shù)據(jù)進(jìn)入reducer前完成排序。因此，如果用sort by進(jìn)行排序，并且設(shè)置mapred.reduce.tasks>1，則sort by只保證每個(gè)reducer的輸出有序，不保證全局有序。
distribute by(字段)根據(jù)指定的字段將數(shù)據(jù)分到不同的reducer，且分發(fā)算法是hash散列。
Cluster by(字段) 除了具有Distribute by的功能外，還會(huì)對(duì)該字段進(jìn)行排序。

因此，如果分桶和sort字段是同一個(gè)時(shí)，此時(shí)，cluster by = distribute by + sort by

select * from inner_tableselect count(*) from inner_table

刪除表時(shí)，元數(shù)據(jù)與數(shù)據(jù)都會(huì)被刪除

drop table inner_table

10. Hive中的join

LEFT JOIN，RIGHT JOIN， FULL OUTER JOIN ,inner join, left semi join

準(zhǔn)備數(shù)據(jù)
1,a
2,b
3,c
4,d
7,y
8,u

2,bb
3,cc
7,yy
9,pp

建表：

create table a(id int,name string)row format delimited fields terminated by ',';create table b(id int,name string)row format delimited fields terminated by ',';

導(dǎo)入數(shù)據(jù)：

load data local inpath '/home/hadoop/a.txt' into table a;load data local inpath '/home/hadoop/b.txt' into table b;

1. inner join

select * from a inner join b on a.id=b.id;

+-------+---------+-------+---------+--+
| a.id | a.name | b.id | b.name |
+-------+---------+-------+---------+--+
| 2 | b | 2 | bb |
| 3 | c | 3 | cc |
| 7 | y | 7 | yy |
+-------+---------+-------+---------+--+
就是求交集。

2. inner join

select * from a left join b on a.id=b.id;

+-------+---------+-------+---------+--+
| a.id | a.name | b.id | b.name |
+-------+---------+-------+---------+--+
| 1 | a | NULL | NULL |
| 2 | b | 2 | bb |
| 3 | c | 3 | cc |
| 4 | d | NULL | NULL |
| 7 | y | 7 | yy |
| 8 | u | NULL | NULL |
+-------+---------+-------+---------+--+
左邊沒有找到連接的置空。

3. right join

select * from a right join b on a.id=b.id;

4. full outer join

select * from a full outer join b on a.id=b.id;

+-------+---------+-------+---------+--+
| a.id | a.name | b.id | b.name |
+-------+---------+-------+---------+--+
| 1 | a | NULL | NULL |
| 2 | b | 2 | bb |
| 3 | c | 3 | cc |
| 4 | d | NULL | NULL |
| 7 | y | 7 | yy |
| 8 | u | NULL | NULL |
| NULL | NULL | 9 | pp |
+-------+---------+-------+---------+--+
兩邊數(shù)據(jù)都顯示。

5. left semi join

select * from a left semi join b on a.id = b.id;

+-------+---------+--+
| a.id | a.name |
+-------+---------+--+
| 2 | b |
| 3 | c |
| 7 | y |
+-------+---------+--+
只返回左邊一半，即a的東西，效率高一點(diǎn)。

11. 創(chuàng)建臨時(shí)表

可以存儲(chǔ)中間結(jié)果。
CREATE TABLE tab_ip_ctas
AS
SELECT id new_id, name new_name, ip new_ip,country new_country
FROM tab_ip_ext
SORT BY new_id;

03. Hive 的自定義函數(shù)和Transform

當(dāng)Hive提供的內(nèi)置函數(shù)無(wú)法滿足你的業(yè)務(wù)處理需要時(shí)，此時(shí)就可以考慮使用用戶自定義函數(shù)（UDF：user-defined function）。

1. 自定義函數(shù)類別

UDF 作用于單個(gè)數(shù)據(jù)行，產(chǎn)生一個(gè)數(shù)據(jù)行作為輸出。（數(shù)學(xué)函數(shù)，字符串函數(shù)）
UDAF（用戶定義聚集函數(shù)）：接收多個(gè)輸入數(shù)據(jù)行，并產(chǎn)生一個(gè)輸出數(shù)據(jù)行。（count，max）

2.UDF開發(fā)實(shí)例

1、先開發(fā)一個(gè)Java類，繼承UDF，并重載evaluate方法

public class ToLowerCase extends UDF {    // 必須是public
    public String evaluate(String field) {        String result = field.toLowerCase();        return result;
    }
}

2、打成jar包上傳到

5、即可在hql中使用自定義的函數(shù)strip

select id,tolowercase(name) from t_p;

3.Transform實(shí)現(xiàn)

Hive的 TRANSFORM 關(guān)鍵字提供了在SQL中調(diào)用自寫腳本的功能
適合實(shí)現(xiàn)Hive中沒有的功能又不想寫UDF的情況。

1、先加載rating.json文件到hive的一個(gè)原始表 rat_json

//{"movie":"1721","rate":"3","timeStamp":"965440048","uid":"5114"}
create table rat_json(line string) row format delimited;load data local inpath '/opt/rating.json' into table rat_json

2、需要解析json數(shù)據(jù)成四個(gè)字段，插入一張新的表 t_rating(用于存放處理的數(shù)據(jù)，需要有四個(gè)字段)

create table t_rating(movieid string,rate int,timestring string,uid string)row format delimited fields terminated by '\t';insert overwrite table t_ratingselect get_json_object(line,'$.movie') as moive,get_json_object(line,'$.rate') as rate  from rat_json;
//get_json_object是內(nèi)置jason函數(shù)，也可以自定義UDF函數(shù)實(shí)現(xiàn)

3、使用transform+Python的方式去轉(zhuǎn)換unixtime為weekday

（1）先編輯一個(gè)python腳本文件
python代碼：

vi weekday_mapper.py
#!/bin/pythonimport sysimport datetimefor line in sys.stdin:
  line = line.strip()
  movieid, rating, unixtime,userid = line.split('\t')
  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()  print '\t'.join([movieid, rating, str(weekday),userid])

（2）將文件加入hive的classpath：

hive>add FILE /home/hadoop/weekday_mapper.py;
hive>create TABLE u_data_new asSELECT
  TRANSFORM (movieid, rate, timestring,uid)  USING 'python weekday_mapper.py'
  AS (movieid, rate, weekday,uid)FROM t_rating;select distinct(weekday) from u_data_new limit 10;

將查詢處理過的新數(shù)據(jù)插入u_data_new文件中。

網(wǎng)站欄目：Hive基本操作
標(biāo)題來(lái)源：http://weahome.cn/article/ihscdd.html

真实的国产乱ⅩXXX66竹夫人,五月香六月婷婷激情综合,亚洲日本VA一区二区三区,亚洲精品一区二区三区麻豆

Hive基本操作

01.Hive是什么

1. Hive介紹

2. Hive架構(gòu)

3. Hive與Hadoop的關(guān)系

4. Hive安裝部署

5. Hive的thrift服務(wù)

02.Hive的基本操作

1. 創(chuàng)建數(shù)據(jù)庫(kù)

2. 創(chuàng)建表

3. 導(dǎo)入數(shù)據(jù)

4. external表

5. Partition表（分區(qū)表）

6. 分桶表

1. 創(chuàng)建分桶表

2. 分桶表內(nèi)插入數(shù)據(jù)

3. 分桶表的原理與作用

7. insert語(yǔ)句

8. 保存select查詢結(jié)果的幾種方式

9. 查看、刪除

10. Hive中的join

1. inner join

2. inner join

3. right join

4. full outer join

5. left semi join

11. 創(chuàng)建臨時(shí)表

03. Hive 的自定義函數(shù)和Transform

1. 自定義函數(shù)類別

2.UDF開發(fā)實(shí)例

3.Transform實(shí)現(xiàn)

其他資訊

網(wǎng)站制作

企業(yè)服務(wù)

網(wǎng)站建設(shè)

服務(wù)器托管

真实的国产乱ⅩXXX66竹夫人,五月香六月婷婷激情综合,亚洲日本VA一区二区三区,亚洲精品一区二区三区麻豆

Hive基本操作

01.Hive是什么

1. Hive介紹

2. Hive架構(gòu)

3. Hive與Hadoop的關(guān)系

4. Hive安裝部署

5. Hive的thrift服務(wù)

02.Hive的基本操作

1. 創(chuàng)建數(shù)據(jù)庫(kù)

2. 創(chuàng)建表

3. 導(dǎo)入數(shù)據(jù)

4. external表

5. Partition表（分區(qū)表）

6. 分桶表

1. 創(chuàng)建分桶表

2. 分桶表內(nèi)插入數(shù)據(jù)

3. 分桶表的原理與作用

7. insert語(yǔ)句

8. 保存select查詢結(jié)果的幾種方式

9. 查看、刪除

10. Hive中的join

1. inner join

2. inner join

3. right join

4. full outer join

5. left semi join

11. 創(chuàng)建臨時(shí)表

03. Hive 的自定義函數(shù)和Transform

1. 自定義函數(shù)類別

2.UDF開發(fā)實(shí)例

3.Transform實(shí)現(xiàn)

其他資訊

網(wǎng)站制作

企業(yè)服務(wù)

網(wǎng)站建設(shè)

服務(wù)器托管

9. 查看、刪除