Tuesday, 27 December 2022

Hive: How to create custom function?

In this post, I am going to explain how to define a custom function in Hive.

 

It is a three step process

a.   Implement custom function by extending Hive UDF class in Java

b.   Export the java project as jar and add to Hive

c.    Create a function as an alias to the class defined in step 1.

 

Follow below step-by-step procedure helps you to build a temporary function.

 

Create new maven project in Eclipse.

 

File -> New -> Maven Project

 


 

Select the check box ‘Create a simple project (skip archetype selection).




Click on Next button.

 


Give

a.   Group Id as ‘com.sample.app’.

b.   Artifact Id as ‘hive-functions’.

c.    Version as 1

 

Click on Finish button. Project structure looks like below.

 


 

Create new package ‘com.sample.app’

Right click on the project -> New -> Package.




Give the package name as ‘com.sample.app’ and click on Finish button.

 


 

Define ‘CamelCaseConverter’ class. Right click on the package ‘com.sample.app’ -> New -> Class.




Give the class name as CamelCaseConverter.

 

 


Click on Finish button.

 

Download and add library to the project ‘hive-functions’

I faced some problems while adding hive-exec maven dependency, that’s why I am downloading the dependency manually.

 

Go to below location and download the library.

https://mvnrepository.com/artifact/org.apache.hive/hive-exec/3.1.3

 

Right click on the project -> Build Path -> Configure Build Path…

 

 


Click on ‘Add External JARs…’ button and add the jar file.

 


Click on Apply and Close button.

 

Once the jar is added successfully, update CamelCaseConverter class.

 

CamelCaseConverter.java

package com.sample.app;

import org.apache.hadoop.hive.ql.exec.UDF;

// Remove spaces in the string and convert to camel case
public class CamelCaseConverter extends UDF {

	public String evaluate(String input) {
		return convertToCamelCase(input);

	}

	public static String convertToCamelCase(String input) {

		if (input == null || input.isEmpty()) {
			return input;
		}

		String[] tokens = input.split("[\\W_]+");

		StringBuilder builder = new StringBuilder();

		for (int i = 0; i < tokens.length; i++) {
			String token = tokens[i];
			if (i == 0) {
				token = token.isEmpty() ? token : token.toLowerCase();
				builder.append(token);
				continue;
			}
			token = token.isEmpty() ? token : Character.toUpperCase(token.charAt(0)) + token.substring(1).toLowerCase();

			builder.append(token);
		}
		return builder.toString();
	}
	
}

Export this project as jar file.

Right click on the project -> Export…




Select Jar file.




Click on Next button.




Click on Browse button and select the location on where to store the jar file. Click on Finish button.

 

Add jar to a hdfs location.

Execute below commands.

hadoop fs -mkdir /hive-jars
hadoop fs -copyFromLocal hive-functions.jar /hive-jars
hadoop fs -ls /hive-jars

[root@quickstart Desktop]# hadoop fs -mkdir /hive-jars
[root@quickstart Desktop]# 
[root@quickstart Desktop]# hadoop fs -copyFromLocal hive-functions.jar /hive-jars
22/04/20 02:17:56 WARN hdfs.DFSClient: Caught exception 
java.lang.InterruptedException
	at java.lang.Object.wait(Native Method)
	at java.lang.Thread.join(Thread.java:1281)
	at java.lang.Thread.join(Thread.java:1355)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:967)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
[root@quickstart Desktop]# 
[root@quickstart Desktop]# hadoop fs -ls /hive-jars
Found 1 items
-rw-r--r--   1 root supergroup       1816 2022-04-20 02:17 /hive-jars/hive-functions.jar

Create a function in hive

CREATE FUNCTION camel_case AS 'com.sample.app.CamelCaseConverter' USING JAR 'hdfs://localhost:8020/hive-jars/hive-functions.jar';

hive> CREATE FUNCTION camel_case AS 'com.sample.app.CamelCaseConverter' USING JAR 'hdfs://localhost:8020/hive-jars/hive-functions.jar';
converting to local hdfs://localhost:8020/hive-jars/hive-functions.jar
Added [/tmp/da34e98c-09c9-4517-8dac-dd8949ac0921_resources/hive-functions.jar] to class path
Added resources: [hdfs://localhost:8020/hive-jars/hive-functions.jar]
OK
Time taken: 0.531 seconds

Let’s create a table ‘t1’ and experiment ‘camel_case’ function with t1 table.

CREATE TABLE t1 (id INT, data STRING);

INSERT INTO t1 VALUES
(1, 'rate of interest'), 
(2, 'time limit'), 
(3, 'requests per second');

hive> CREATE TABLE t1 (id INT, data STRING);
OK
Time taken: 0.128 seconds
hive> ;
hive> ;
hive> INSERT INTO t1 VALUES
    > (1, 'rate of interest'), 
    > (2, 'time limit'), 
    > (3, 'requests per second');
Query ID = cloudera_20220420022323_79bd7acb-c7d9-4571-bbdf-fe6d1f61a240
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1649172504056_0036, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1649172504056_0036/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1649172504056_0036
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-04-20 02:23:12,979 Stage-1 map = 0%,  reduce = 0%
2022-04-20 02:23:19,550 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.26 sec
MapReduce Total cumulative CPU time: 1 seconds 260 msec
Ended Job = job_1649172504056_0036
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://quickstart.cloudera:8020/user/hive/warehouse/t1/.hive-staging_hive_2022-04-20_02-23-04_876_2800648824973981435-1/-ext-10000
Loading data to table default.t1
Table default.t1 stats: [numFiles=1, numRows=3, totalSize=54, rawDataSize=51]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.26 sec   HDFS Read: 3826 HDFS Write: 120 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 260 msec
OK
Time taken: 16.012 seconds
hive> ;
hive> ;
hive> SELECT * FROM t1;
OK
1	rate of interest
2	time limit
3	requests per second
Time taken: 0.098 seconds, Fetched: 3 row(s)
hive>

Apply camel_case function on the table ‘t1’.

hive> SELECT id, data, camel_case(data) FROM t1;
OK
1	rate of interest	rateOfInterest
2	time limit	timeLimit
3	requests per second	requestsPerSecond
Time taken: 0.061 seconds, Fetched: 3 row(s)



Previous                                                    Next                                                    Home

No comments:

Post a Comment