1500字范文,内容丰富有趣,写作好帮手!
1500字范文 > ORC文件存储格式和Hive创建ORC表

ORC文件存储格式和Hive创建ORC表

时间:2023-02-14 10:10:03

相关推荐

ORC文件存储格式和Hive创建ORC表

首发于: /archives/134.html

ORC的优点

TheOptimized Row Columnar (ORC)file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.

Compared with RCFile format, for example, ORC file format has many advantages such as:

a single file as the output of each task, which reduces the NameNode’s loadHive type support including datetime, decimal, and the complex types (struct, list, map, and union)light-weight indexes stored within the file skip row groups that don’t pass predicate filteringseek to a given row block-mode compression based on data type run-length encoding for integer columnsdictionary encoding for string columns concurrent reads of the same file using separate RecordReadersability to split files without scanning for markersbound the amount of memory needed for reading or writingmetadata stored using Protocol Buffers, which allows addition and removal of fields

ORC文件结构

ORC文件由stripe,file footer,postscript组成。

file footer

contains a list of stripes in the file, the number of rows per stripe, and each column’s data type. It also contains column-level aggregates count, min, max, and sum.postscript

holds compression parameters and the size of the compressed footer.stripe

each stripe in an ORC file holds index data, row data, and a stripe footer.

index dataincludes min and max values for each column and the row positions within each column.

stripe footercontains a directory of stream locations.

row datais used in table scans.

Hive Table properties

The parameters placed in the TBLPROPERTIES.

创建ORC表

CREATE TABLE IF NOT EXISTS bank.account_orc (`id_card` int,`tran_time` string,`name` string,`cash` int)partitioned by(ds string)stored as orc;

不加pression,默认为ZLIB压缩。另外,还支持设置press为NONE, Snappy。

查询ORC建表语句

SHOW CREATE TABLE bank.account_orc;

CREATE TABLE `bank.account_orc`(`id_card` int, `tran_time` string, `name` string, `cash` int)PARTITIONED BY ( `ds` string)ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'LOCATION'hdfs://nameservice1/user/hive/warehouse/bank.db/account_orc'TBLPROPERTIES ('transient_lastDdlTime'='1627435885')

往ORC表插入样本数据

INSERT INTO bank.account_orc partition(ds='-09-21') values (1000, '-09-21 14:30:00', 'Tom', 100);INSERT INTO bank.account_orc partition(ds='-09-20') values (1000, '-09-20 14:30:05', 'Tom', 50);INSERT INTO bank.account_orc partition(ds='-09-20') values (1000, '-09-20 14:30:10', 'Tom', -25);INSERT INTO bank.account_orc partition(ds='-09-21') values (1001, '-09-21 15:30:00', 'Jelly', 200);INSERT INTO bank.account_orc partition(ds='-09-21') values (1001, '-09-21 15:30:05', 'Jelly', -50);

查询底层文件

[root@ ~]# hadoop fs -ls /user/hive/warehouse/bank.db/account_orc/ds=-09-20 Found 2 items-rwxrwx--x+ 3 hive hive 519 -07-28 09:33 /user/hive/warehouse/bank.db/account_orc/ds=-09-20/000000_0-rwxrwx--x+ 3 hive hive 519 -07-28 09:34 /user/hive/warehouse/bank.db/account_orc/ds=-09-20/000000_0_copy_1[root@ ~]# [root@ ~]# hadoop fs -ls /user/hive/warehouse/bank.db/account_orc/ds=-09-21Found 3 items-rwxrwx--x+ 3 hive hive 516 -07-28 09:32 /user/hive/warehouse/bank.db/account_orc/ds=-09-21/000000_0-rwxrwx--x+ 3 hive hive 528 -07-28 09:34 /user/hive/warehouse/bank.db/account_orc/ds=-09-21/000000_0_copy_1-rwxrwx--x+ 3 hive hive 528 -07-28 09:35 /user/hive/warehouse/bank.db/account_orc/ds=-09-21/000000_0_copy_2

查询ORC文件的元数据

hive --orcfiledump hdfs:///user/hive/warehouse/bank.db/account_orc/ds=-09-20/000000_0

[root@ ~]# hive --orcfiledump hdfs:///user/hive/warehouse/bank.db/account_orc/ds=-09-20/000000_0Structure for hdfs:///user/hive/warehouse/bank.db/account_orc/ds=-09-20/000000_0File Version: 0.12 with HIVE_873221/07/28 09:52:12 INFO orc.ReaderImpl: Reading ORC rows from hdfs:/user/hive/warehouse/bank.db/account_orc/ds=-09-20/000000_0 with {include: null, offset: 0, length: 9223372036854775807}Rows: 1Compression: ZLIBCompression size: 262144Type: struct<_col0:int,_col1:string,_col2:string,_col3:int>Stripe Statistics:Stripe 1:Column 0: count: 1 hasNull: falseColumn 1: count: 1 hasNull: false min: 1000 max: 1000 sum: 1000Column 2: count: 1 hasNull: false min: -09-20 14:30:05 max: -09-20 14:30:05 sum: 19Column 3: count: 1 hasNull: false min: Tom max: Tom sum: 3Column 4: count: 1 hasNull: false min: 50 max: 50 sum: 50File Statistics:Column 0: count: 1 hasNull: falseColumn 1: count: 1 hasNull: false min: 1000 max: 1000 sum: 1000Column 2: count: 1 hasNull: false min: -09-20 14:30:05 max: -09-20 14:30:05 sum: 19Column 3: count: 1 hasNull: false min: Tom max: Tom sum: 3Column 4: count: 1 hasNull: false min: 50 max: 50 sum: 50Stripes:Stripe: offset: 3 data: 53 rows: 1 tail: 65 index: 136Stream: column 0 section ROW_INDEX start: 3 length 11Stream: column 1 section ROW_INDEX start: 14 length 27Stream: column 2 section ROW_INDEX start: 41 length 45Stream: column 3 section ROW_INDEX start: 86 length 29Stream: column 4 section ROW_INDEX start: 115 length 24Stream: column 1 section DATA start: 139 length 7Stream: column 2 section DATA start: 146 length 22Stream: column 2 section LENGTH start: 168 length 6Stream: column 3 section DATA start: 174 length 6Stream: column 3 section LENGTH start: 180 length 6Stream: column 4 section DATA start: 186 length 6Encoding column 0: DIRECTEncoding column 1: DIRECT_V2Encoding column 2: DIRECT_V2Encoding column 3: DIRECT_V2Encoding column 4: DIRECT_V2File length: 519 bytesPadding length: 0 bytesPadding ratio: 0%

查询ORC文件的数据

hive --orcfiledump -d hdfs:///user/hive/warehouse/bank.db/account_orc/ds=-09-20/000000_0

[root@ ~]# hive --orcfiledump -d hdfs:///user/hive/warehouse/bank.db/account_orc/ds=-09-20/000000_021/07/28 09:53:08 INFO orc.ReaderImpl: Reading ORC rows from hdfs:/user/hive/warehouse/bank.db/account_orc/ds=-09-20/000000_0 with {include: null, offset: 0, length: 9223372036854775807}{"_col0":1000,"_col1":"-09-20 14:30:05","_col2":"Tom","_col3":50}[root@ ~]# [root@ ~]# hive --orcfiledump -d hdfs:///user/hive/warehouse/bank.db/account_orc/ds=-09-20/000000_0_copy_121/07/28 09:53:50 INFO orc.ReaderImpl: Reading ORC rows from hdfs:/user/hive/warehouse/bank.db/account_orc/ds=-09-20/000000_0_copy_1 with {include: null, offset: 0, length: 9223372036854775807}{"_col0":1000,"_col1":"-09-20 14:30:10","_col2":"Tom","_col3":-25}

创建ORC表+Snappy压缩

CREATE TABLE IF NOT EXISTS bank.account_orc_snappy (`id_card` int,`tran_time` string,`name` string,`cash` int)partitioned by(ds string)stored as orcTBLPROPERTIES ("pression"="SNAPPY");

查询ORC带Snappy压缩的建表语句

SHOW CREATE TABLE bank.account_orc_snappy;

CREATE TABLE `bank.account_orc_snappy`(`id_card` int, `tran_time` string, `name` string, `cash` int)PARTITIONED BY ( `ds` string)ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'LOCATION'hdfs://nameservice1/user/hive/warehouse/bank.db/account_orc_snappy'TBLPROPERTIES ('pression'='SNAPPY', 'transient_lastDdlTime'='1627542655')

参考文档

/confluence/display/Hive/LanguageManual+ORC/specification/ORCv1//dabokele/article/details/51542327/dabokele/article/details/51813322

欢迎关注我的微信公众号“九万里大数据”,原创技术文章第一时间推送。

欢迎访问原创技术博客网站 ,排版更清晰,阅读更爽快。

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。