Tuesday, 27 December 2022

Hive: How to create custom temporary function?

In this post, I am going to explain how to define a custom function in Hive.

 

It is a three step process

a.   Implement custom function in Java

b.   Export the java project as jar and add to Hive

c.    Create a temporary function as an alias to the function defined in step 1.

 

Follow below step-by-step procedure helps you to build a temporary function.

 

Create new maven project in Eclipse.

 

File -> New -> Maven Project

 


Select the check box ‘Create a simple project (skip archetype selection).




Click on Next button.

 


Give

a.   Group Id as ‘com.sample.app’.

b.   Artifact Id as ‘hive-functions’.

c.    Version as 1

 

Click on Finish button. Project structure looks like below.




Create new package ‘com.sample.app’

Right click on the project -> New -> Package.

 


Give the package name as ‘com.sample.app’ and click on Finish button.




Define ‘CamelCaseConverter’ class. Right click on the package ‘com.sample.app’ -> New -> Class.

 

 


Give the class name as CamelCaseConverter.

 


 

Click on Finish button.

 

Download and add library to the project ‘hive-functions’

I faced some problems while adding hive-exec maven dependency, that’s why I am downloading the dependency manually.

 

Go to below location and download the library.

https://mvnrepository.com/artifact/org.apache.hive/hive-exec/3.1.3

 

Right click on the project -> Build Path -> Configure Build Path…

 

 


Click on ‘Add External JARs…’ button and add the jar file.




Click on Apply and Close button.

 

Once the jar is added successfully, update CamelCaseConverter class.

 

CamelCaseConverter.java

package com.sample.app;

import org.apache.hadoop.hive.ql.exec.UDF;

// Remove spaces in the string and convert to camel case
public class CamelCaseConverter extends UDF {

	public String evaluate(String input) {
		return convertToCamelCase(input);

	}

	public static String convertToCamelCase(String input) {

		if (input == null || input.isEmpty()) {
			return input;
		}

		String[] tokens = input.split("[\\W_]+");

		StringBuilder builder = new StringBuilder();

		for (int i = 0; i < tokens.length; i++) {
			String token = tokens[i];
			if (i == 0) {
				token = token.isEmpty() ? token : token.toLowerCase();
				builder.append(token);
				continue;
			}
			token = token.isEmpty() ? token : Character.toUpperCase(token.charAt(0)) + token.substring(1).toLowerCase();

			builder.append(token);
		}
		return builder.toString();
	}
	
}

Export this project as jar file.

Right click on the project -> Export…




Select Jar file.

 


 

Click on Next button.




Click on Browse button and select the location on where to store the jar file. Click on Finish button.

 

Add jar to hive

Login to hive interactive shell.

# hive

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive>

 

Add jar to hive by executing below command.

ADD JAR /home/cloudera/Desktop/SharedData/hive-functions.jar

 

hive> ADD JAR /home/cloudera/Desktop/hive-functions.jar;
Added [/home/cloudera/Desktop/hive-functions.jar] to class path
Added resources: [/home/cloudera/Desktop/hive-functions.jar]

 

Create temporary function by executing below command.

CREATE TEMPORARY FUNCTION camel_case AS 'com.sample.app.CamelCaseConverter';
hive> CREATE TEMPORARY FUNCTION camel_case AS 'com.sample.app.CamelCaseConverter';
OK
Time taken: 0.362 seconds

 

Create t1 table and insert some data into it.

CREATE TABLE t1 (id INT, data STRING);

INSERT INTO t1 VALUES
(1, 'rate of interest'), 
(2, 'time limit'), 
(3, 'requests per second');

hive> CREATE TABLE t1 (id INT, data STRING);
OK
Time taken: 0.268 seconds
hive> ;
hive> ;
hive> INSERT INTO t1 VALUES
    > (1, 'rate of interest'), 
    > (2, 'time limit'), 
    > (3, 'requests per second');
Query ID = root_20220420020404_1a25b17d-0da9-4cfc-91a9-e2331ee3353e
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1649172504056_0035, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1649172504056_0035/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1649172504056_0035
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-04-20 02:04:51,797 Stage-1 map = 0%,  reduce = 0%
2022-04-20 02:05:00,944 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.64 sec
MapReduce Total cumulative CPU time: 1 seconds 640 msec
Ended Job = job_1649172504056_0035
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://quickstart.cloudera:8020/user/hive/warehouse/t1/.hive-staging_hive_2022-04-20_02-04-37_751_2440121772894137663-1/-ext-10000
Loading data to table default.t1
Table default.t1 stats: [numFiles=1, numRows=3, totalSize=54, rawDataSize=51]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.64 sec   HDFS Read: 3818 HDFS Write: 120 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 640 msec
OK
Time taken: 24.611 seconds
hive> ;
hive> ;
hive> ;
hive> SELECT * FROM t1;
OK
1	rate of interest
2	time limit
3	requests per second
Time taken: 0.08 seconds, Fetched: 3 row(s)
hive>

 

Let’s apply the temp function ‘camel_case’ on t1.

hive> SELECT id, data, camel_case(data) FROM t1;
OK
1	rate of interest	rateOfInterest
2	time limit	timeLimit
3	requests per second	requestsPerSecond
Time taken: 0.118 seconds, Fetched: 3 row(s)

One problem with the temporary function is, it will get deleted once your hive session is closed. We can solve this by creating a permanent function. I will explain this in my next post.

 

 

 

 

 

 

Previous                                                    Next                                                    Home

No comments:

Post a Comment