博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Hive动态分区
阅读量:4299 次
发布时间:2019-05-27

本文共 8248 字,大约阅读时间需要 27 分钟。

05-Hive动态分区

2016年06月02日 20:57:54  阅读数:1218

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/qq_29622761/article/details/51570137

大家好!砸门又见面了。我先作一个自我介绍吧。我是一个打算学习大数据一万小时的谢老师,目前学习了800多小时了,现在正在努力。 

今天来玩的实验是:Hive分区表的动态分区

分区是在处理大型事实表时常用的方法。分区的好处在于缩小查询扫描范围,从而提高速度。分区分为两种:静态分区static partition和动态分区dynamic partition。静态分区和动态分区的区别在于导入数据时,是手动输入分区名称,还是通过数据来判断数据分区。对于大数据批量导入来说,显然采用动态分区更为简单方便。 

动态分区不需要为不同的分区添加不同的插入语句。那什么时候使用动态分区呢?其实,当你不知道要怎么分区,这个分区要通过查询语句里得出来,通过查出来的数据建立分区。换句话说就是分区数据不确定的时候,就适合使用动态分区。 
有几个参数,要认识一下:

set hive.exec.dynamic.partition=true// 使用动态分区set hive.exec.dynamic.partition.mode=nonstrick//无限制模式如果模式是stric,则必须有一个静态分区,且放在最前面set hive.exec.max.dynamic.partitions.pernode=10000;//每个节点生成动态分区的最大个数set hive.exec.max.created.files=150000;//一个任务最多可以创建的文件数目set dfs.datanode.max.xcievers=8192;//限定一次最多打开的文件数
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

1 接下来我们来实战一下怎么动态分区:

hive> create table d_part(    > name string    > )    > partitioned by(value string)    > row format delimited fields terminated by '\t'     > lines terminated by '\n'    > stored as textfile;OKTime taken: 0.243 secondshive>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

注意到什么?partitioned by后面没有指定具体的值了。

hive> show partitions d_part;OKTime taken: 0.304 secondshive> select * from d_part;OKTime taken: 0.096 secondshive>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

没有数据对不对?我们来插入一些数据,并且让他动态分区。

hive> set hive.exec.dynamic.partition=true;hive> set hive.exec.dynamic.partition.mode=nonstrick;hive> insert overwrite table d_part partition(value)    > select name,    > addr as value    > from testtext;Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks is set to 0 since there's no reduce operatorJob running in-process (local Hadoop)Hadoop job information for null: number of mappers: 1; number of reducers: 02016-06-02 05:16:16,951 null map = 0%,  reduce = 0%2016-06-02 05:16:32,117 null map = 100%,  reduce = 0%, Cumulative CPU 1.07 sec2016-06-02 05:16:33,226 null map = 100%,  reduce = 0%, Cumulative CPU 1.07 sec2016-06-02 05:16:34,277 null map = 100%,  reduce = 0%, Cumulative CPU 1.07 secMapReduce Total cumulative CPU time: 1 seconds 70 msecEnded Job = job_1464828076391_0016Execution completed successfullyMapred Local Task Succeeded . Convert the Join into MapJoinLoading data to table default.d_part partition (value=null)    Loading partition {value=78}    Loading partition {value=46}    Loading partition {value=89}Partition default.d_part{value=46} stats: [num_files: 1, num_rows: 0, total_size: 4, raw_data_size: 0]Partition default.d_part{value=78} stats: [num_files: 1, num_rows: 0, total_size: 6, raw_data_size: 0]Partition default.d_part{value=89} stats: [num_files: 1, num_rows: 0, total_size: 7, raw_data_size: 0]Table default.d_part stats: [num_partitions: 3, num_files: 3, num_rows: 0, total_size: 17, raw_data_size: 0]OKTime taken: 41.974 secondshive>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30

查看一下数据加载了没?

hive> select * from d_part;OKwer 46weree   78wer 89rr  89Time taken: 0.732 secondshive> show partitions d_part;OKvalue=46value=78value=89Time taken: 0.142 secondshive>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

为什么会有三个partition呢?其实addr as value就说清楚了,addr的值就作为分区值,分别是46,78,89(两个89分到一起了).不信你看:

hive> desc testtext;OKname    string  name valueaddr    string  addr valueTime taken: 0.155 secondshive>  select * from testtext;OKwer 46wer 89weree   78rr  89Time taken: 0.162 seconds
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

我们再看web端的情况: 

搞啥子哟,图片居然上传不上来!!

(此处本该有图)

不生气不生气,待会补上咯!

2 接下来我们要创建两个分区:

hive> create table d_part2(    > name string    > )    > partitioned by(value string,dt string)    > row format delimited fields terminated by '\t'     > lines terminated by '\n'    > stored as textfile;OKTime taken: 0.092 secondshive> set hive.exec.dynamic.partition=true;hive>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

好,停一下。是不是第二次见到set hive.exec.dynamic.partition=true;了?dynamic是动态的意思,明白了吧? 

接着干嘛?插入数据:

hive> insert overwrite table d_part2 partition(value,dt)    > select 'test' as name,      > addr as value,    > name as dt    > from testtext;Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks is set to 0 since there's no reduce operatorJob running in-process (local Hadoop)Hadoop job information for null: number of mappers: 1; number of reducers: 02016-06-02 05:45:36,828 null map = 0%,  reduce = 0%2016-06-02 05:45:52,339 null map = 100%,  reduce = 0%, Cumulative CPU 1.3 sec2016-06-02 05:45:53,443 null map = 100%,  reduce = 0%, Cumulative CPU 1.3 sec2016-06-02 05:45:54,493 null map = 100%,  reduce = 0%, Cumulative CPU 1.3 secMapReduce Total cumulative CPU time: 1 seconds 300 msecEnded Job = job_1464828076391_0017Execution completed successfullyMapred Local Task Succeeded . Convert the Join into MapJoinLoading data to table default.d_part2 partition (value=null, dt=null)    Loading partition {value=89, dt=wer}    Loading partition {value=89, dt=rr}    Loading partition {value=46, dt=wer}    Loading partition {value=78, dt=weree}Partition default.d_part2{value=46, dt=wer} stats: [num_files: 1, num_rows: 0, total_size: 5, raw_data_size: 0]Partition default.d_part2{value=78, dt=weree} stats: [num_files: 1, num_rows: 0, total_size: 5, raw_data_size: 0]Partition default.d_part2{value=89, dt=rr} stats: [num_files: 1, num_rows: 0, total_size: 5, raw_data_size: 0]Partition default.d_part2{value=89, dt=wer} stats: [num_files: 1, num_rows: 0, total_size: 5, raw_data_size: 0]Table default.d_part2 stats: [num_partitions: 4, num_files: 4, num_rows: 0, total_size: 20, raw_data_size: 0]OKTime taken: 40.971 secondshive>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31

各位看官,这里两个分区哦,value 分区下有子分区dt.两个分区都是根据数据的值来分的。 

addr as value, name as dt。你知道d_part2表下面的数据是啥吗?select ‘test’ as name, 都是test。 
来验证一下:

hive> show partitions d_part2;OKvalue=46/dt=wervalue=78/dt=wereevalue=89/dt=rrvalue=89/dt=werTime taken: 0.293 secondshive> select * from d_part2;OKtest    46  wertest    78  wereetest    89  rrtest    89  werTime taken: 0.435 secondshive>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15

学了这两个实验基本就入门了。砸门还是应该多练习一下,对吧?熟能生巧,巧能生能。

hive> create table d_part3(    > name string    > )    > partitioned by(value string,dt string)    > row format delimited fields terminated by '\t'     > lines terminated by '\n'    > stored as textfile;OKTime taken: 0.121 secondshive> set hive.exec.dynamic.partition=true;
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

插入数据:

hive> insert overwrite table d_part3 partition(value,dt)    > select 'test' as name,    > addr as value,    > name as dt    > from testtext;Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks is set to 0 since there's no reduce operatorJob running in-process (local Hadoop)Hadoop job information for null: number of mappers: 1; number of reducers: 02016-06-02 05:54:27,509 null map = 0%,  reduce = 0%2016-06-02 05:54:37,411 null map = 100%,  reduce = 0%, Cumulative CPU 0.99 sec2016-06-02 05:54:38,501 null map = 100%,  reduce = 0%, Cumulative CPU 0.99 sec2016-06-02 05:54:39,555 null map = 100%,  reduce = 0%, Cumulative CPU 0.99 secMapReduce Total cumulative CPU time: 990 msecEnded Job = job_1464828076391_0018Execution completed successfullyMapred Local Task Succeeded . Convert the Join into MapJoinLoading data to table default.d_part3 partition (value=null, dt=null)    Loading partition {value=78, dt=weree}    Loading partition {value=46, dt=wer}    Loading partition {value=89, dt=wer}    Loading partition {value=89, dt=rr}Partition default.d_part3{value=46, dt=wer} stats: [num_files: 1, num_rows: 0, total_size: 5, raw_data_size: 0]Partition default.d_part3{value=78, dt=weree} stats: [num_files: 1, num_rows: 0, total_size: 5, raw_data_size: 0]Partition default.d_part3{value=89, dt=rr} stats: [num_files: 1, num_rows: 0, total_size: 5, raw_data_size: 0]Partition default.d_part3{value=89, dt=wer} stats: [num_files: 1, num_rows: 0, total_size: 5, raw_data_size: 0]Table default.d_part3 stats: [num_partitions: 4, num_files: 4, num_rows: 0, total_size: 20, raw_data_size: 0]OKTime taken: 36.737 secondshive> show partitions d_part3;OKvalue=46/dt=wervalue=78/dt=wereevalue=89/dt=rrvalue=89/dt=werTime taken: 0.264 secondshive>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38

好了,一遍练不会练两边。 

这里给大家布置一个作业:什么情况下设置动态分区会产生灾难?欢迎留言呀!

有点累了,休息一下。如果你看到此文,想进一步学习或者和我沟通,加我微信公众号:五十年后

你可能感兴趣的文章
期货市场技术分析05_交易量和持仓兴趣
查看>>
TB交易开拓者入门教程
查看>>
TB创建公式应用dll失败 请检查用户权限,终极解决方案
查看>>
python绘制k线图(蜡烛图)报错 No module named 'matplotlib.finance
查看>>
talib均线大全
查看>>
期货市场技术分析06_长期图表和商品指数
查看>>
期货市场技术分析07_摆动指数和相反意见理论
查看>>
满屏的指标?删了吧,手把手教你裸 K 交易!
查看>>
不吹不黑 | 聊聊为什么要用99%精度的数据回测
查看>>
X 分钟速成 Python
查看>>
对于模拟交易所引发的思考
查看>>
高频交易的几种策略
查看>>
量化策略回测TRIXKDJ
查看>>
量化策略回测唐安奇通道
查看>>
CTA策略如何过滤部分震荡行情?
查看>>
量化策略回测DualThrust
查看>>
量化策略回测BoolC
查看>>
量化策略回测DCCV2
查看>>
mongodb查询优化
查看>>
五步git操作搞定Github中fork的项目与原作者同步
查看>>