Hive: Joining Multiple Tables in Single query

Yes !!! Hive supports multiple table joins in single query....

We consider this example is very useful in scenario based interview question(Hive). Where you will be given multiple tables and hive queries are asked for analytics.

Lets start learning by example:

Problem statement : Find total amount purchased along with number of transaction for each customer.

The example will be dealing with three tables Customer, Purchases and Transaction.Here I will demonstrate joining three tables and group by on particular column using Hive.
---------------------------------------------------------------------------------------------------
Customer table will have unique customer ID along with other details of Customer.
***********Customer table***********
cust_id|cust_fname|cust_lname|location
1|sujit|fulse|latur
2|roshan|bagdiya|akola
3|ajit|jadhav|beed
---------------------------------------------------------------------------------------------------
Purchases will have unique purchase Id for each purchase. There will be multiple purchases for each customer.
***********Purchases table***********
purchase_id|cid|store
23|1|wallmart
25|2|oasis
27|1|trad
29|2|wallmart
30|3|oasis
37|3|trad
---------------------------------------------------------------------------------------------------
Transaction table will have unique transaction Id for each transaction along with purchase amount for each transaction
***********Transaction table***********
purchase_id|transa_date|purchase_amt
23|2016-01-19|300
25|2016-01-19|200
27|2016-01-19|100
29|2016-01-19|500
30|2016-01-19|400
37|2016-01-19|100
---------------------------------------------------------------------------------------------------
As u can see, fields are separated by pipe in all tables.

Lets deal with hive queries to load data in hive tables. we are generating three tables. Commands are as below..
---------------------------------------------------------------------------------------------------
Loading tables in hive:

create table custs(cust_id bigint, cust_fname string,cust_lname string, location string )
row format delimited
fields terminated by "|"
stored as textfile;
load data local inpath 'customer' into table custs ;

create table pur(pur_id bigint, cid string,store string)
row format delimited
fields terminated by "|"
stored as textfile;
load data local inpath 'purchase' into table pur ;


create table trans(pur_id bigint, tran_date string,amt string)
row format delimited
fields terminated by "|"
stored as textfile;
load data local inpath 'Transaction' into table trans ;
---------------------------------------------------------------------------------------------------
Running Hive queries to solve problem statement:

(All below three queries, produces same output but implemented by different approach. we are interested in third approach where we have joined multiple table in single query)

1) Simple approach is writing multiple hive queries, generate some intermediate table, using intermediate tables and get final output

create table temp as
select c.cust_id as cust,p.pur_id as pid
from custs c join pur p on (c.cust_id =p.cid)

select t.cust, sum(ts.amt),count(*) from temp t
join trans ts on( t.pid = ts.pur_id)
group by t.cust;

2) Writing nested query for above two separate queries .

select t.cust, sum(ts.amt),count(*) from
(select c.cust_id as cust,p.pur_id as pid from custs c join pur p on (c.cust_id =p.cid)  ) t
join trans ts on( t.pid = ts.pur_id)
group by t.cust;

3) Joining Muliple tables in single query.

select c.cust_fname, sum(t.amt) , count(*)
from custs c
inner join pur p on c.cust_id = p.cid
inner join trans t on p.pur_id=t.pur_id
group by c.cust_fname ;


---------------------------------------------------------------------------------------------------

hadoop