Hive: Joining Multiple Tables in Single query

11:08:00 AM 4:24:22 PM

Yes !!! Hive supports multiple table joins in single query....

We consider this example is very useful in scenario based interview question(Hive). Where you will be given multiple tables and hive queries are asked for analytics.

Lets start learning by example:

Problem statement : Find total amount purchased along with number of transaction for each customer.

The example will be dealing with three tables Customer, Purchases and Transaction.Here I will demonstrate joining three tables and group by on particular column using Hive.

---------------------------------------------------------------------------------------------------

Customer table will have unique customer ID along with other details of Customer.

***********Customer table***********

cust_id|cust_fname|cust_lname|location

1|sujit|fulse|latur

2|roshan|bagdiya|akola

3|ajit|jadhav|beed

---------------------------------------------------------------------------------------------------

Purchases will have unique purchase Id for each purchase. There will be multiple purchases for each customer.

***********Purchases table***********

purchase_id|cid|store

23|1|wallmart

25|2|oasis

27|1|trad

29|2|wallmart

30|3|oasis

37|3|trad

---------------------------------------------------------------------------------------------------

Transaction table will have unique transaction Id for each transaction along with purchase amount for each transaction

***********Transaction table***********

purchase_id|transa_date|purchase_amt

23|2016-01-19|300

25|2016-01-19|200

27|2016-01-19|100

29|2016-01-19|500

30|2016-01-19|400

37|2016-01-19|100

---------------------------------------------------------------------------------------------------

As u can see, fields are separated by pipe in all tables.

Lets deal with hive queries to load data in hive tables. we are generating three tables. Commands are as below..

---------------------------------------------------------------------------------------------------

Loading tables in hive:

create table custs(cust_id bigint, cust_fname string,cust_lname string, location string )

row format delimited

fields terminated by "|"

stored as textfile;

load data local inpath 'customer' into table custs ;

create table pur(pur_id bigint, cid string,store string)

row format delimited

fields terminated by "|"

stored as textfile;

load data local inpath 'purchase' into table pur ;

create table trans(pur_id bigint, tran_date string,amt string)

row format delimited

fields terminated by "|"

stored as textfile;

load data local inpath 'Transaction' into table trans ;

---------------------------------------------------------------------------------------------------

Running Hive queries to solve problem statement:

(All below three queries, produces same output but implemented by different approach. we are interested in third approach where we have joined multiple table in single query)

1) Simple approach is writing multiple hive queries, generate some intermediate table, using intermediate tables and get final output

create table temp as

select c.cust_id as cust,p.pur_id as pid

from custs c join pur p on (c.cust_id =p.cid)

select t.cust, sum(ts.amt),count(*) from temp t

join trans ts on( t.pid = ts.pur_id)

group by t.cust;

2) Writing nested query for above two separate queries .

select t.cust, sum(ts.amt),count(*) from

(select c.cust_id as cust,p.pur_id as pid from custs c join pur p on (c.cust_id =p.cid) ) t

join trans ts on( t.pid = ts.pur_id)

group by t.cust;

3) Joining Muliple tables in single query.

select c.cust_fname, sum(t.amt) , count(*)

from custs c

inner join pur p on c.cust_id = p.cid

inner join trans t on p.pur_id=t.pur_id

group by c.cust_fname ;

---------------------------------------------------------------------------------------------------

Hive: Joining Multiple Tables in Single query

Post a Comment