Programming for beginners: Hive: subqueries

Subquery is a query nested in another query. You can embed a subquery in FROM, WHERE clauses in Hive.

Example 1: subquery in WHERE clause

SELECT * FROM employee 
WHERE employee.id IN (SELECT emp_id FROM employee_address where city='Bangalore');

Example 2: Union of records from tables employee and employee1.

SELECT * FROM
(
   SELECT * FROM employee
   UNION ALL
   SELECT * FROM employee1
) t;

Let’s experiment it with below example.

employee table

Id	name

employee_address table

id	emp_id	street	city	country

Let’s create employee and employee_address tables by executing below statements.

hive> CREATE TABLE employee (id INT, name STRING);
OK
Time taken: 0.169 seconds
hive> ;
hive> ;
hive> CREATE TABLE employee_address(id INT, emp_id INT, street STRING, city STRING, country STRING);
OK
Time taken: 0.063 seconds

Insert data into employee table.

INSERT INTO employee VALUES
(1, 'Ravi'), 
(2, 'Ram'), 
(3, 'Siva');

hive> INSERT INTO employee VALUES
    > (1, 'Ravi'), 
    > (2, 'Ram'), 
    > (3, 'Siva');
Query ID = cloudera_20220417034141_8380194c-1f74-41cb-858f-1f043d733b31
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1649172504056_0022, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1649172504056_0022/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1649172504056_0022
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-04-17 03:41:56,698 Stage-1 map = 0%,  reduce = 0%
2022-04-17 03:42:05,291 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.39 sec
MapReduce Total cumulative CPU time: 1 seconds 390 msec
Ended Job = job_1649172504056_0022
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://quickstart.cloudera:8020/user/hive/warehouse/employee/.hive-staging_hive_2022-04-17_03-41-47_758_1924624601576248628-1/-ext-10000
Loading data to table default.employee
Table default.employee stats: [numFiles=1, numRows=3, totalSize=20, rawDataSize=17]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.39 sec   HDFS Read: 3834 HDFS Write: 92 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 390 msec
OK
Time taken: 19.176 seconds
hive>

Insert data into employee_address table.

INSERT INTO employee_address VALUES
(11, 1, 'Chowdeswari street', 'Bangalore', 'India'), 
(12, 2, 'Devendra Nagar', 'Hyderabad', 'India'), 
(13, 3, 'Rajaji Nagar', 'Bangaore', 'India');

hive> INSERT INTO employee_address VALUES
    > (11, 1, 'Chowdeswari street', 'Bangalore', 'India'), 
    > (12, 2, 'Devendra Nagar', 'Hyderabad', 'India'), 
    > (13, 3, 'Rajaji Nagar', 'Bangaore', 'India');
Query ID = cloudera_20220417034242_9be76655-c460-4891-a1d0-43ab47d78efd
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1649172504056_0023, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1649172504056_0023/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1649172504056_0023
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-04-17 03:42:51,156 Stage-1 map = 0%,  reduce = 0%
2022-04-17 03:42:57,567 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.31 sec
MapReduce Total cumulative CPU time: 1 seconds 310 msec
Ended Job = job_1649172504056_0023
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://quickstart.cloudera:8020/user/hive/warehouse/employee_address/.hive-staging_hive_2022-04-17_03-42-43_996_4248675605426025187-1/-ext-10000
Loading data to table default.employee_address
Table default.employee_address stats: [numFiles=1, numRows=3, totalSize=109, rawDataSize=106]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.31 sec   HDFS Read: 4538 HDFS Write: 190 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 310 msec
OK
Time taken: 14.76 seconds

Let’s print and confirm the data in employee, employee_address tables.

hive> SET hive.cli.print.header=true;
hive> ;
hive> ;
hive> SELECT * FROM employee;
OK
employee.id employee.name
1   Ravi
2   Ram
3   Siva
Time taken: 0.086 seconds, Fetched: 3 row(s)
hive> ;
hive> ;
hive> SELECT * FROM employee_address;
OK
employee_address.id employee_address.emp_id employee_address.street employee_address.city   employee_address.country
11  1   Chowdeswari street  Bangalore   India
12  2   Devendra Nagar  Hyderabad   India
13  3   Rajaji Nagar    Bangaore    India
Time taken: 0.036 seconds, Fetched: 3 row(s)

Subquery in WHERE clause

Get all the employees whose city is ‘Bangalore’.

SELECT * FROM employee 
WHERE employee.id IN (SELECT emp_id FROM employee_address where city='Bangalore');

hive> SELECT * FROM employee 
    > WHERE employee.id IN (SELECT emp_id FROM employee_address where city='Bangalore');
Query ID = cloudera_20220417034646_5b5f015c-e344-487a-be23-16bd10683a1b
Total jobs = 1
Execution log at: /tmp/cloudera/cloudera_20220417034646_5b5f015c-e344-487a-be23-16bd10683a1b.log
2022-04-17 03:46:55	Starting to launch local task to process map join;	maximum memory = 1013645312
2022-04-17 03:46:56	Dump the side-table for tag: 1 with group count: 1 into file: file:/tmp/cloudera/16ec1170-56f8-45e7-9873-522e06448b92/hive_2022-04-17_03-46-50_841_181343302282253929-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile01--.hashtable
2022-04-17 03:46:56	Uploaded 1 File to: file:/tmp/cloudera/16ec1170-56f8-45e7-9873-522e06448b92/hive_2022-04-17_03-46-50_841_181343302282253929-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile01--.hashtable (278 bytes)
2022-04-17 03:46:56	End of local task; Time Taken: 1.244 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1649172504056_0024, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1649172504056_0024/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1649172504056_0024
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2022-04-17 03:47:05,536 Stage-3 map = 0%,  reduce = 0%
2022-04-17 03:47:13,095 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.33 sec
MapReduce Total cumulative CPU time: 1 seconds 330 msec
Ended Job = job_1649172504056_0024
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 1   Cumulative CPU: 1.33 sec   HDFS Read: 6080 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 330 msec
OK
employee.id	employee.name
1	Ravi
Time taken: 23.323 seconds, Fetched: 1 row(s)

Subquery in FROM clause

Let’s create another table like employee1 and union these students with the help of subquery in Hive.

CREATE TABLE employee1 LIKE employee;

INSERT INTO employee1 VALUES
(1, 'Raheem'), 
(12, 'Joel'), 
(13, 'Sailu');

hive> create table employee1 like employee;
OK
Time taken: 0.071 seconds
hive> ;
hive> ;
hive> INSERT INTO employee1 VALUES
    > (1, 'Raheem'), 
    > (12, 'Joel'), 
    > (13, 'Sailu');
Query ID = cloudera_20220417035252_6e18b30f-3c57-43ea-846f-7aad713c5cec
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1649172504056_0025, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1649172504056_0025/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1649172504056_0025
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-04-17 03:52:37,790 Stage-1 map = 0%,  reduce = 0%
2022-04-17 03:52:45,259 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.48 sec
MapReduce Total cumulative CPU time: 1 seconds 480 msec
Ended Job = job_1649172504056_0025
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://quickstart.cloudera:8020/user/hive/warehouse/employee1/.hive-staging_hive_2022-04-17_03-52-30_578_3077524688190003825-1/-ext-10000
Loading data to table default.employee1
Table default.employee1 stats: [numFiles=1, numRows=3, totalSize=26, rawDataSize=23]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.48 sec   HDFS Read: 3857 HDFS Write: 99 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 480 msec
OK
_col0   _col1
Time taken: 15.885 seconds
hive>

Let’s print the records from employee1 table.

hive> SELECT * FROM employee1;
OK
employee1.id    employee1.name
1   Raheem
12  Joel
13  Sailu
Time taken: 0.074 seconds, Fetched: 3 row(s)

Let’s union the data from employee and employee1 tables.

SELECT * FROM
(
   SELECT * FROM employee
   UNION ALL
   SELECT * FROM employee1
) t;

hive> SELECT * FROM
    > (
    >    SELECT * FROM employee
    >    UNION ALL
    >    SELECT * FROM employee1
    > ) t;
Query ID = cloudera_20220417035555_1595511b-b3a9-4950-8843-6d5f2723f009
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1649172504056_0026, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1649172504056_0026/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1649172504056_0026
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2022-04-17 03:55:24,506 Stage-1 map = 0%,  reduce = 0%
2022-04-17 03:55:35,453 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 1.07 sec
2022-04-17 03:55:36,493 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.99 sec
MapReduce Total cumulative CPU time: 1 seconds 990 msec
Ended Job = job_1649172504056_0026
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 1.99 sec   HDFS Read: 9685 HDFS Write: 46 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 990 msec
OK
t.id	t.name
1	Raheem
12	Joel
13	Sailu
1	Ravi
2	Ram
3	Siva
Time taken: 20.103 seconds, Fetched: 6 row(s)
hive>

Previous Next Home

Programming for beginners

Sunday, 25 December 2022

Hive: subqueries

No comments:

Post a Comment