Hive中各种Join的实现
一. 数据准备
1. 创建两张表
create table tablea (id int, name string) row format delimited fields terminated by ',';
create table tableb (id int, age int) row format delimited fields terminated by ',';
2. 准备两份数据
tablea.txt文件数据如下:
1,huangbo
2,xuzheng
4,wangbaoqiang
6,huangxiaoming
7,fengjie
10,liudehua
tableb.txt 文件数据如下:
2,20
4,50
7,80
10,22
12,33
15,44
3. 将数据导入对应的表中
load data local inpath '/root/tablea.txt' into table tablea;
load data local inpath '/root/tableb.txt' into table tableb;
二. JOIN实现
1. inner join(内连接)
inner join 就是取两表的交集
select * from tablea a inner join tableb b on a.id = b.id;
结果:
2 xuzheng 2 20
4 wangbaoqiang 4 50
7 fengjie 7 80
10 liudehua 10 22
2. left outer join (左外连接)
left outer join是以左表基准,右表不存在的key均赋值为null
select * from tablea a left join tableb b on a.id = b.id;
结果:
1 huangbo NULL NULL
2 xuzheng 2 20
4 wangbaoqiang 4 50
6 huangxiaoming NULL NULL
7 fengjie 7 80
10 liudehua 10 22
3. right outer join(右外连接)
right outer join以右表基准,左表不存在的key均赋值为null
select * from tablea a right join tableb b on a.id = b.id;
结果:
2 xuzheng 2 20
4 wangbaoqiang 4 50
7 fengjie 7 80
10 liudehua 10 22
NULL NULL 12 33
NULL NULL 15 44
4. full outer join(全外连接)
full outer join 是对左右两表求并集,两个表中不存在的key均赋值null
select * from tablea a full outer join tableb b on a.id = b.id;
结果:
1 huangbo NULL NULL
2 xuzheng 2 20
4 wangbaoqiang 4 50
6 huangxiaoming NULL NULL
7 fengjie 7 80
10 liudehua 10 22
NULL NULL 12 33
NULL NULL 15 44
5. left semi join(左半连接)
left semi join返回两个表交集中左表的部分,Hive低版本中没有实现标准SQL中exist, in,因此它最主要的使用场景就是替代exist与in。
需求:
找出tablea表在tableb表中同时存在的id的记录,这个需求用IN和EXISTS的实现语句如下:
-- IN
select * from tablea where id in (select id from tableb);
-- EXISTS
select * from tablea where exists (select * from tableb where tablea.id = tableb.id);
运行结果:
2 xuzheng
4 wangbaoqiang
7 fengjie
10 liudehua
这个需求在Hive低版本中可以使用left semi join实现,同时可以获得更好的性能。
select * from tablea a left semi join tableb b on a.id = b.id;
运行结果:
2 xuzheng
4 wangbaoqiang
7 fengjie
10 liudehua