這期內(nèi)容當(dāng)中小編將會(huì)給大家?guī)?lái)有關(guān)Data Lake Analytics + OSS數(shù)據(jù)文件的格式處理分別是怎樣的,文章內(nèi)容豐富且以專業(yè)的角度為大家分析和敘述,閱讀完這篇文章希望大家可以有所收獲。
創(chuàng)新互聯(lián)為您提適合企業(yè)的網(wǎng)站設(shè)計(jì)?讓您的網(wǎng)站在搜索引擎具有高度排名,讓您的網(wǎng)站具備超強(qiáng)的網(wǎng)絡(luò)競(jìng)爭(zhēng)力!結(jié)合企業(yè)自身,進(jìn)行網(wǎng)站設(shè)計(jì)及把握,最后結(jié)合企業(yè)文化和具體宗旨等,才能創(chuàng)作出一份性化解決方案。從網(wǎng)站策劃到網(wǎng)站設(shè)計(jì)制作、成都做網(wǎng)站, 我們的網(wǎng)頁(yè)設(shè)計(jì)師為您提供的解決方案。
Data Lake Analytics是Serverless化的云上交互式查詢分析服務(wù)。用戶可以使用標(biāo)準(zhǔn)的SQL語(yǔ)句,對(duì)存儲(chǔ)在OSS、TableStore上的數(shù)據(jù)無(wú)需移動(dòng),直接進(jìn)行查詢分析。
除了純文本文件(例如,CSV,TSV等),用戶存儲(chǔ)在OSS上的其他格式的數(shù)據(jù)文件,也可以使用Data Lake Analytics進(jìn)行查詢分析,包括ORC, PARQUET, JSON, RCFILE, AVRO甚至ESRI規(guī)范的地理JSON數(shù)據(jù),還可以用正則表達(dá)式匹配的文件等。
下面詳細(xì)介紹如何根據(jù)存儲(chǔ)在OSS上的文件格式使用Data Lake Analytics (下文簡(jiǎn)稱 DLA)進(jìn)行分析。DLA內(nèi)置了各種處理文件數(shù)據(jù)的SerDe(Serialize/Deserilize的簡(jiǎn)稱,目的是用于序列化和反序列化)實(shí)現(xiàn),用戶無(wú)需自己編寫(xiě)程序,基本上能選用DLA中的一款或多款SerDe來(lái)匹配您OSS上的數(shù)據(jù)文件格式。
用戶可以依據(jù)存儲(chǔ)在OSS上的數(shù)據(jù)文件進(jìn)行建表,通過(guò)STORED AS 指定數(shù)據(jù)文件的格式。
例如,
CREATE EXTERNAL TABLE nation ( N_NATIONKEY INT, N_NAME STRING, N_REGIONKEY INT, N_COMMENT STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/nation';
建表成功后可以使用SHOW CREATE TABLE語(yǔ)句查看原始建表語(yǔ)句。
MySQL> show create table nation; +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Result | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | CREATE EXTERNAL TABLE `nation`( `n_nationkey` int, `n_name` string, `n_regionkey` int, `n_comment` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS `TEXTFILE` LOCATION 'oss://test-bucket-julian-1/tpch_100m/nation'| +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (1.81 sec)
下表中列出了目前DLA已經(jīng)支持的文件格式,當(dāng)針對(duì)下列格式的文件建表時(shí),可以直接使用STORED AS,DLA會(huì)選擇合適的SERDE/INPUTFORMAT/OUTPUTFORMAT。
存儲(chǔ)格式 | 描述 |
STORED AS TEXTFILE | 數(shù)據(jù)文件的存儲(chǔ)格式為純文本文件。默認(rèn)的文件類型。 文件中的每一行對(duì)應(yīng)表中的一條記錄。 |
STORED AS ORC | 數(shù)據(jù)文件的存儲(chǔ)格式為ORC。 |
STORED AS PARQUET | 數(shù)據(jù)文件的存儲(chǔ)格式為PARQUET。 |
STORED AS RCFILE | 數(shù)據(jù)文件的存儲(chǔ)格式為RCFILE。 |
STORED AS AVRO | 數(shù)據(jù)文件的存儲(chǔ)格式為AVRO。 |
STORED AS JSON | 數(shù)據(jù)文件的存儲(chǔ)格式為JSON (Esri ArcGIS的地理JSON數(shù)據(jù)文件 除外)。 |
在指定了STORED AS 的同時(shí),還可以根據(jù)具體文件的特點(diǎn),指定SerDe (用于解析數(shù)據(jù)文件并映射到DLA表),特殊的列分隔符等。
后面的部分會(huì)做進(jìn)一步的講解。
CSV文件,本質(zhì)上還是純文本文件,可以使用STORED AS TEXTFILE。
列與列之間以逗號(hào)分隔,可以通過(guò)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 表示。
例如,數(shù)據(jù)文件oss://bucket-for-testing/oss/text/cities/city.csv的內(nèi)容為
Beijing,China,010 ShangHai,China,021 Tianjin,China,022
建表語(yǔ)句可以為
CREATE EXTERNAL TABLE city ( city STRING, country STRING, code INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION 'oss://bucket-for-testing/oss/text/cities';
OpenCSVSerde在使用時(shí)需要注意以下幾點(diǎn):
用戶可以為行的字段指定字段分隔符、字段內(nèi)容引用符號(hào)和轉(zhuǎn)義字符,例如:WITH SERDEPROPERTIES ("separatorChar" = ",", "quoteChar" = "`", "escapeChar" = "\" );
不支持字段內(nèi)嵌入的行分割符;
所有字段定義STRING類型;
其他數(shù)據(jù)類型的處理,可以在SQL中使用函數(shù)進(jìn)行轉(zhuǎn)換。
例如,
CREATE EXTERNAL TABLE test_csv_opencsvserde ( id STRING, name STRING, location STRING, create_date STRING, create_timestamp STRING, longitude STRING, latitude STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties( 'separatorChar'=',', 'quoteChar'='"', 'escapeChar'='\\' ) STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/test_csv_serde_1';
需要自定義列分隔符(FIELDS TERMINATED BY),轉(zhuǎn)義字符(ESCAPED BY),行結(jié)束符(LINES TERMINATED BY)。
需要在建表語(yǔ)句中指定
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ESCAPED BY '\\' LINES TERMINATED BY '\n'
在csv文件中,有時(shí)會(huì)帶有HEADER信息,需要在數(shù)據(jù)讀取時(shí)忽略掉這些內(nèi)容。這時(shí)需要在建表語(yǔ)句中定義skip.header.line.count。
例如,數(shù)據(jù)文件oss://my-bucket/datasets/tpch/nation_csv/nation_header.tbl的內(nèi)容如下:
N_NATIONKEY|N_NAME|N_REGIONKEY|N_COMMENT 0|ALGERIA|0| haggle. carefully final deposits detect slyly agai| 1|ARGENTINA|1|al foxes promise slyly according to the regular accounts. bold requests alon| 2|BRAZIL|1|y alongside of the pending deposits. carefully special packages are about the ironic forges. slyly special | 3|CANADA|1|eas hang ironic, silent packages. slyly regular packages are furiously over the tithes. fluffily bold| 4|EGYPT|4|y above the carefully unusual theodolites. final dugouts are quickly across the furiously regular d| 5|ETHIOPIA|0|ven packages wake quickly. regu|
相應(yīng)的建表語(yǔ)句為:
CREATE EXTERNAL TABLE nation_header ( N_NATIONKEY INT, N_NAME STRING, N_REGIONKEY INT, N_COMMENT STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://my-bucket/datasets/tpch/nation_csv/nation_header.tbl' TBLPROPERTIES ("skip.header.line.count"="1");
skip.header.line.count的取值x和數(shù)據(jù)文件的實(shí)際行數(shù)n有如下關(guān)系:
當(dāng)x<=0時(shí),DLA在讀取文件時(shí),不會(huì)過(guò)濾掉任何信息,即全部讀取;
當(dāng)0
當(dāng)x>=n時(shí),DLA在讀取文件時(shí),會(huì)過(guò)濾掉所有的文件內(nèi)容。
與CSV文件類似,TSV格式的文件也是純文本文件,列與列之間的分隔符為T(mén)ab。
例如,數(shù)據(jù)文件oss://bucket-for-testing/oss/text/cities/city.tsv的內(nèi)容為
Beijing China 010 ShangHai China 021 Tianjin China 022
建表語(yǔ)句可以為
CREATE EXTERNAL TABLE city ( city STRING, country STRING, code INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION 'oss://bucket-for-testing/oss/text/cities';
假設(shè)您的數(shù)據(jù)字段的分隔符包含多個(gè)字符,可采用如下示例建表語(yǔ)句,其中每行的數(shù)據(jù)字段分割符為“||”,可以替換為您具體的分割符字符串。
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' with serdeproperties( "field.delim"="||" )
示例:
CREATE EXTERNAL TABLE test_csv_multidelimit ( id STRING, name STRING, location STRING, create_date STRING, create_timestamp STRING, longitude STRING, latitude STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' with serdeproperties( "field.delim"="||" ) STORED AS TEXTFILE LOCATION 'oss://bucket-for-testing/oss/text/cities/';
DLA可以處理的JSON文件通常以純文本的格式存儲(chǔ),數(shù)據(jù)文件的編碼方式需要是UTF-8。
在JSON文件中,每行必須是一個(gè)完整的JSON對(duì)象。
例如,下面的文件格式是不被接受的
{"id": 123, "name": "jack", "c3": "2001-02-03 12:34:56"} {"id": 456, "name": "rose", "c3": "1906-04-18 05:12:00"} {"id": 789, "name": "tom", "c3": "2001-02-03 12:34:56"} {"id": 234, "name": "alice", "c3": "1906-04-18 05:12:00"}
需要改寫(xiě)成:
{"id": 123, "name": "jack", "c3": "2001-02-03 12:34:56"} {"id": 456, "name": "rose", "c3": "1906-04-18 05:12:00"} {"id": 789, "name": "tom", "c3": "2001-02-03 12:34:56"} {"id": 234, "name": "alice", "c3": "1906-04-18 05:12:00"}
建表語(yǔ)句可以寫(xiě)
CREATE EXTERNAL TABLE t1 (id int, name string, c3 timestamp) STORED AS JSON LOCATION 'oss://path/to/t1/directory';
使用struct和array結(jié)構(gòu)定義嵌套的JSON數(shù)據(jù)。
例如,用戶原始數(shù)據(jù)(注意:無(wú)論是否嵌套,一條完整的JSON數(shù)據(jù)都只能放在一行上,才能被Data Lake Analytics處理):
{ "DocId": "Alibaba", "User_1": { "Id": 1234, "Username": "bob1234", "Name": "Bob", "ShippingAddress": { "Address1": "969 Wenyi West St.", "Address2": null, "City": "Hangzhou", "Province": "Zhejiang" }, "Orders": [{ "ItemId": 6789, "OrderDate": "11/11/2017" }, { "ItemId": 4352, "OrderDate": "12/12/2017" } ] } }
使用在線JSON格式化工具格式化后,數(shù)據(jù)內(nèi)容如下:
{ "DocId": "Alibaba", "User_1": { "Id": 1234, "Username": "bob1234", "Name": "Bob", "ShippingAddress": { "Address1": "969 Wenyi West St.", "Address2": null, "City": "Hangzhou", "Province": "Zhejiang" }, "Orders": [ { "ItemId": 6789, "OrderDate": "11/11/2017" }, { "ItemId": 4352, "OrderDate": "12/12/2017" } ] } }
則建表語(yǔ)句可以寫(xiě)成如下(注意:LOCATION中指定的路徑必須是JSON數(shù)據(jù)文件所在的目錄,該目錄下的所有JSON文件都能被識(shí)別為該表的數(shù)據(jù)):
CREATE EXTERNAL TABLE json_table_1 ( docid string, user_1 struct< id:INT, username:string, name:string, shippingaddress:struct< address1:string, address2:string, city:string, province:string >, orders:array< struct< itemid:INT, orderdate:string > > > ) STORED AS JSON LOCATION 'oss://xxx/test/json/hcatalog_serde/table_1/';
對(duì)該表進(jìn)行查詢:
select * from json_table_1; +---------+----------------------------------------------------------------------------------------------------------------+ | docid | user_1 | +---------+----------------------------------------------------------------------------------------------------------------+ | Alibaba | [1234, bob1234, Bob, [969 Wenyi West St., null, Hangzhou, Zhejiang], [[6789, 11/11/2017], [4352, 12/12/2017]]] | +---------+----------------------------------------------------------------------------------------------------------------+
對(duì)于struct定義的嵌套結(jié)構(gòu),可以通過(guò)“.”進(jìn)行層次對(duì)象引用,對(duì)于array定義的數(shù)組結(jié)構(gòu),可以通過(guò)“[數(shù)組下標(biāo)]”(注意:數(shù)組下標(biāo)從1開(kāi)始)進(jìn)行對(duì)象引用。
select DocId, User_1.Id, User_1.ShippingAddress.Address1, User_1.Orders[1].ItemId from json_table_1 where User_1.Username = 'bob1234' and User_1.Orders[2].OrderDate = '12/12/2017'; +---------+------+--------------------+-------+ | DocId | id | address1 | _col3 | +---------+------+--------------------+-------+ | Alibaba | 1234 | 969 Wenyi West St. | 6789 | +---------+------+--------------------+-------+
例如,把“value_string”的嵌套JSON值作為字符串存儲(chǔ):
{"data_key":"com.taobao.vipserver.domains.meta.biz.alibaba.com","ts":1524550275112,"value_string":"{\"appName\":\"\",\"apps\":[],\"checksum\":\"50fa0540b430904ee78dff07c7350e1c\",\"clusterMap\":{\"DEFAULT\":{\"defCkport\":80,\"defIPPort\":80,\"healthCheckTask\":null,\"healthChecker\":{\"checkCode\":200,\"curlHost\":\"\",\"curlPath\":\"/status.taobao\",\"type\":\"HTTP\"},\"name\":\"DEFAULT\",\"nodegroup\":\"\",\"sitegroup\":\"\",\"submask\":\"0.0.0.0/0\",\"syncConfig\":{\"appName\":\"trade-ma\",\"nodegroup\":\"tradema\",\"pubLevel\":\"publish\",\"role\":\"\",\"site\":\"\"},\"useIPPort4Check\":true}},\"disabledSites\":[],\"enableArmoryUnit\":false,\"enableClientBeat\":false,\"enableHealthCheck\":true,\"enabled\":true,\"envAndSites\":\"\",\"invalidThreshold\":0.6,\"ipDeleteTimeout\":1800000,\"lastModifiedMillis\":1524550275107,\"localSiteCall\":true,\"localSiteThreshold\":0.8,\"name\":\"biz.alibaba.com\",\"nodegroup\":\"\",\"owners\":[\"junlan.zx\",\"張三\",\"李四\",\"cui.yuanc\"],\"protectThreshold\":0,\"requireSameEnv\":false,\"resetWeight\":false,\"symmetricCallType\":null,\"symmetricType\":\"warehouse\",\"tagName\":\"ipGroup\",\"tenantId\":\"\",\"tenants\":[],\"token\":\"1cf0ec0c771321bb4177182757a67fb0\",\"useSpecifiedURL\":false}"}
使用在線JSON格式化工具格式化后,數(shù)據(jù)內(nèi)容如下:
{ "data_key": "com.taobao.vipserver.domains.meta.biz.alibaba.com", "ts": 1524550275112, "value_string": "{\"appName\":\"\",\"apps\":[],\"checksum\":\"50fa0540b430904ee78dff07c7350e1c\",\"clusterMap\":{\"DEFAULT\":{\"defCkport\":80,\"defIPPort\":80,\"healthCheckTask\":null,\"healthChecker\":{\"checkCode\":200,\"curlHost\":\"\",\"curlPath\":\"/status.taobao\",\"type\":\"HTTP\"},\"name\":\"DEFAULT\",\"nodegroup\":\"\",\"sitegroup\":\"\",\"submask\":\"0.0.0.0/0\",\"syncConfig\":{\"appName\":\"trade-ma\",\"nodegroup\":\"tradema\",\"pubLevel\":\"publish\",\"role\":\"\",\"site\":\"\"},\"useIPPort4Check\":true}},\"disabledSites\":[],\"enableArmoryUnit\":false,\"enableClientBeat\":false,\"enableHealthCheck\":true,\"enabled\":true,\"envAndSites\":\"\",\"invalidThreshold\":0.6,\"ipDeleteTimeout\":1800000,\"lastModifiedMillis\":1524550275107,\"localSiteCall\":true,\"localSiteThreshold\":0.8,\"name\":\"biz.alibaba.com\",\"nodegroup\":\"\",\"owners\":[\"junlan.zx\",\"張三\",\"李四\",\"cui.yuanc\"],\"protectThreshold\":0,\"requireSameEnv\":false,\"resetWeight\":false,\"symmetricCallType\":null,\"symmetricType\":\"warehouse\",\"tagName\":\"ipGroup\",\"tenantId\":\"\",\"tenants\":[],\"token\":\"1cf0ec0c771321bb4177182757a67fb0\",\"useSpecifiedURL\":false}" }
建表語(yǔ)句為
CREATE external TABLE json_table_2 ( data_key string, ts bigint, value_string string ) STORED AS JSON LOCATION 'oss://xxx/test/json/hcatalog_serde/table_2/';
表建好后,可進(jìn)行查詢:
select * from json_table_2; +---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | data_key | ts | value_string | +---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | com.taobao.vipserver.domains.meta.biz.alibaba.com | 1524550275112 | {"appName":"","apps":[],"checksum":"50fa0540b430904ee78dff07c7350e1c","clusterMap":{"DEFAULT":{"defCkport":80,"defIPPort":80,"healthCheckTask":null,"healthChecker":{"checkCode":200,"curlHost":"","curlPath":"/status.taobao","type":"HTTP"},"name":"DEFAULT","nodegroup":"","sitegroup":"","submask":"0.0.0.0/0","syncConfig":{"appName":"trade-ma","nodegroup":"tradema","pubLevel":"publish","role":"","site":""},"useIPPort4Check":true}},"disabledSites":[],"enableArmoryUnit":false,"enableClientBeat":false,"enableHealthCheck":true,"enabled":true,"envAndSites":"","invalidThreshold":0.6,"ipDeleteTimeout":1800000,"lastModifiedMillis":1524550275107,"localSiteCall":true,"localSiteThreshold":0.8,"name":"biz.alibaba.com","nodegroup":"","owners":["junlan.zx","張三","李四","cui.yuanc"],"protectThreshold":0,"requireSameEnv":false,"resetWeight":false,"symmetricCallType":null,"symmetricType":"warehouse","tagName":"ipGroup","tenantId":"","tenants":[],"token":"1cf0ec0c771321bb4177182757a67fb0","useSpecifiedURL":false} | +---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
下面SQL示例json_parse,json_extract_scalar,json_extract等常用JSON函數(shù)的使用方式:
mysql> select json_extract_scalar(json_parse(value), '$.owners[1]') from json_table_2; +--------+ | _col0 | +--------+ | 張三 | +--------+ mysql> select json_extract_scalar(json_obj.json_col, '$.DEFAULT.submask') from ( select json_extract(json_parse(value), '$.clusterMap') as json_col from json_table_2 ) json_obj where json_extract_scalar(json_obj.json_col, '$.DEFAULT.healthChecker.curlPath') = '/status.taobao'; +-----------+ | _col0 | +-----------+ | 0.0.0.0/0 | +-----------+ mysql> with json_obj as (select json_extract(json_parse(value), '$.clusterMap') as json_col from json_table_2) select json_extract_scalar(json_obj.json_col, '$.DEFAULT.submask') from json_obj where json_extract_scalar(json_obj.json_col, '$.DEFAULT.healthChecker.curlPath') = '/status.taobao'; +-----------+ | _col0 | +-----------+ | 0.0.0.0/0 | +-----------+
Optimized Row Columnar(ORC)是Apache開(kāi)源項(xiàng)目Hive支持的一種優(yōu)化的列存儲(chǔ)文件格式。與CSV文件相比,不僅可以節(jié)省存儲(chǔ)空間,還可以得到更好的查詢性能。
對(duì)于ORC文件,只需要在建表時(shí)指定 STORED AS ORC。
例如,
CREATE EXTERNAL TABLE orders_orc_date ( O_ORDERKEY INT, O_CUSTKEY INT, O_ORDERSTATUS STRING, O_TOTALPRICE DOUBLE, O_ORDERDATE DATE, O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT, O_COMMENT STRING ) STORED AS ORC LOCATION 'oss://bucket-for-testing/datasets/tpch/1x/orc_date/orders_orc';
Parquet是Apache開(kāi)源項(xiàng)目Hadoop支持的一種列存儲(chǔ)的文件格式。
使用DLA建表時(shí),需要指定STORED AS PARQUET即可。
例如,
CREATE EXTERNAL TABLE orders_parquet_date ( O_ORDERKEY INT, O_CUSTKEY INT, O_ORDERSTATUS STRING, O_TOTALPRICE DOUBLE, O_ORDERDATE DATE, O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT, O_COMMENT STRING ) STORED AS PARQUET LOCATION 'oss://bucket-for-testing/datasets/tpch/1x/parquet_date/orders_parquet';
Record Columnar File (RCFile), 列存儲(chǔ)文件,可以有效地將關(guān)系型表結(jié)構(gòu)存儲(chǔ)在分布式系統(tǒng)中,并且可以被高效地讀取和處理。
DLA在建表時(shí),需要指定STORED AS RCFILE。
例如,
CREATE EXTERNAL TABLE lineitem_rcfile_date ( L_ORDERKEY INT, L_PARTKEY INT, L_SUPPKEY INT, L_LINENUMBER INT, L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE, L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING, L_SHIPDATE DATE, L_COMMITDATE DATE, L_RECEIPTDATE DATE, L_SHIPINSTRUCT STRING, L_SHIPMODE STRING, L_COMMENT STRING ) STORED AS RCFILE LOCATION 'oss://bucke-for-testing/datasets/tpch/1x/rcfile_date/lineitem_rcfile'
DLA針對(duì)AVRO文件建表時(shí),需要指定STORED AS AVRO,并且定義的字段需要符合AVRO文件的schema。
如果不確定可以通過(guò)使用Avro提供的工具,獲得schema,并根據(jù)schema建表。
在Apache Avro官網(wǎng)下載avro-tools-.jar到本地,執(zhí)行下面的命令獲得Avro文件的schema:
java -jar avro-tools-1.8.2.jar getschema /path/to/your/doctors.avro { "type" : "record", "name" : "doctors", "namespace" : "testing.hive.avro.serde", "fields" : [ { "name" : "number", "type" : "int", "doc" : "Order of playing the role" }, { "name" : "first_name", "type" : "string", "doc" : "first name of actor playing role" }, { "name" : "last_name", "type" : "string", "doc" : "last name of actor playing role" } ] }
建表語(yǔ)句如下,其中fields中的name對(duì)應(yīng)表中的列名,type需要參考本文檔中的表格轉(zhuǎn)成DLA支持的類型
CREATE EXTERNAL TABLE doctors( number int, first_name string, last_name string) STORED AS AVRO LOCATION 'oss://mybucket-for-testing/directory/to/doctors';
大多數(shù)情況下,Avro的類型可以直接轉(zhuǎn)換成DLA中對(duì)應(yīng)的類型。如果該類型在DLA不支持,則會(huì)轉(zhuǎn)換成接近的類型。具體請(qǐng)參照下表:
Avro類型 | 對(duì)應(yīng)DLA類型 |
---|---|
null | void |
boolean | boolean |
int | int |
long | bigint |
float | float |
double | double |
bytes | binary |
string | string |
record | struct |
map | map |
list | array |
union | union |
enum | string |
fixed | binary |
通常此類型的文件是以純文本格式存儲(chǔ)在OSS上的,每一行代表表中的一條記錄,并且每行可以用正則表達(dá)式匹配。
例如,Apache WebServer日志文件就是這種類型的文件。
某日志文件的內(nèi)容為:
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 127.0.0.1 - - [26/May/2009:00:00:00 +0000] "GET /someurl/?track=Blabla(Main) HTTP/1.1" 200 5864 - "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.65 Safari/525.19"
每行文件可以用下面的正則表達(dá)式表示,列之間使用空格分隔:
([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?
針對(duì)上面的文件格式,建表語(yǔ)句可以表示為:
CREATE EXTERNAL TABLE serde_regex( host STRING, identity STRING, userName STRING, time STRING, request STRING, status STRING, size INT, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?" ) STORED AS TEXTFILE LOCATION 'oss://bucket-for-testing/datasets/serde/regex';
查詢結(jié)果
mysql> select * from serde_regex; +-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+ | host | identity | userName | time | request | status | size | referer | agent | +-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+ | 127.0.0.1 | - | frank | [10/Oct/2000:13:55:36 -0700] | "GET /apache_pb.gif HTTP/1.0" | 200 | 2326 | NULL | NULL | | 127.0.0.1 | - | - | [26/May/2009:00:00:00 +0000] | "GET /someurl/?track=Blabla(Main) HTTP/1.1" | 200 | 5864 | - | "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.65 Safari/525.19" | +-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+
DLA支持Esri ArcGIS的地理JSON數(shù)據(jù)文件的SerDe處理,關(guān)于這種地理JSON數(shù)據(jù)格式說(shuō)明,可以參考:https://github.com/Esri/spatial-framework-for-hadoop/wiki/JSON-Formats
示例:
CREATE EXTERNAL TABLE IF NOT EXISTS california_counties ( Name string, BoundaryShape binary ) ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde' STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'oss://test_bucket/datasets/geospatial/california-counties/'
通過(guò)以上例子可以看出,DLA可以支持大部分開(kāi)源存儲(chǔ)格式的文件。對(duì)于同一份數(shù)據(jù),使用不同的存儲(chǔ)格式,在OSS中存儲(chǔ)文件的大小,DLA的查詢分析速度上會(huì)有較大的差別。推薦使用ORC格式進(jìn)行文件的存儲(chǔ)和查詢。
為了獲得更快的查詢速度,DLA還在不斷的優(yōu)化中,后續(xù)也會(huì)支持更多的數(shù)據(jù)源,為用戶帶來(lái)更好的大數(shù)據(jù)分析體驗(yàn)。
上述就是小編為大家分享的Data Lake Analytics + OSS數(shù)據(jù)文件的格式處理分別是怎樣的了,如果剛好有類似的疑惑,不妨參照上述分析進(jìn)行理解。如果想知道更多相關(guān)知識(shí),歡迎關(guān)注創(chuàng)新互聯(lián)行業(yè)資訊頻道。