Map Side Join
This is one of the hive features to speed up the Hive queries.Map Side Join can be performed when one of the joining tables is small enough to fit into memory.
It will save the time taken by unnecessary shuffling of the data to reducer by performing join at mappers.
It is not the default behavior of Hive engine.
It is governed by enabling following properties.
hive.auto.convert.join = true;
One above property is set true ,during joins if the table size is less than 25 MB(hive.mapjoin.smalltable.filesize), the join is converted to map-join.
Bucket Map Join
Bucket Map Join can be performed when both tables are bucketed on the common column on which join is being performed.For Example,
table1(id1,name1) cluster by (id1) into 2 buckets
table2(id2,name2) cluster by (id2) into 2 buckets
table1 join table2 on id1==id2
All the records belong to same id fall into one bucket,you wont find records in different bucket with same id.
Bucket Map Join is performed at the mapper side by sending buckets containing records with same id to single mapper(bucket_table1 joint bucket_table2) rather than sending complete table2 to each mapper for join.
This is not the default behavior and can be achieved by setting following properties.
select /*+ MAPJOIN(table2) table1.* from table1 t1 join table2 t2 on t1.id1=t2.id2*/
or
set hive.optimize.bucketmapjoin=true
If both tables are sorted by id on which bucketing and joining were performed then sort-merge join can be performed by setting following properties.
hive.input.format = org.apache.hadoop.hive.ql.BukcetizedHiveInputFormat;
hive.optimize.bucketmapjoin=true
hive.optimize.bucketmapjoin.sortedmerge = true
No comments:
Post a Comment