In this post, I am going to explain how to define a custom function in Hive.
It is a three step process
a. Implement custom function in Java
b. Export the java project as jar and add to Hive
c. Create a temporary function as an alias to the function defined in step 1.
Follow below step-by-step procedure helps you to build a temporary function.
Create new maven project in Eclipse.
File -> New -> Maven Project
Select the check box ‘Create a simple project (skip archetype selection).
Click on Next button.
Give
a. Group Id as ‘com.sample.app’.
b. Artifact Id as ‘hive-functions’.
c. Version as 1
Click on Finish button. Project structure looks like below.
Create new package ‘com.sample.app’
Right click on the project -> New -> Package.
Give the package name as ‘com.sample.app’ and click on Finish button.
Define ‘CamelCaseConverter’ class. Right click on the package ‘com.sample.app’ -> New -> Class.
Give the class name as CamelCaseConverter.
Click on Finish button.
Download and add library to the project ‘hive-functions’
I faced some problems while adding hive-exec maven dependency, that’s why I am downloading the dependency manually.
Go to below location and download the library.
https://mvnrepository.com/artifact/org.apache.hive/hive-exec/3.1.3
Right click on the project -> Build Path -> Configure Build Path…
Click on ‘Add External JARs…’ button and add the jar file.
Click on Apply and Close button.
Once the jar is added successfully, update CamelCaseConverter class.
CamelCaseConverter.java
package com.sample.app;
import org.apache.hadoop.hive.ql.exec.UDF;
// Remove spaces in the string and convert to camel case
public class CamelCaseConverter extends UDF {
public String evaluate(String input) {
return convertToCamelCase(input);
}
public static String convertToCamelCase(String input) {
if (input == null || input.isEmpty()) {
return input;
}
String[] tokens = input.split("[\\W_]+");
StringBuilder builder = new StringBuilder();
for (int i = 0; i < tokens.length; i++) {
String token = tokens[i];
if (i == 0) {
token = token.isEmpty() ? token : token.toLowerCase();
builder.append(token);
continue;
}
token = token.isEmpty() ? token : Character.toUpperCase(token.charAt(0)) + token.substring(1).toLowerCase();
builder.append(token);
}
return builder.toString();
}
}
Export this project as jar file.
Right click on the project -> Export…
Select Jar file.
Click on Next button.
Click on Browse button and select the location on where to store the jar file. Click on Finish button.
Add jar to hive
Login to hive interactive shell.
# hive Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties WARNING: Hive CLI is deprecated and migration to Beeline is recommended. hive>
Add jar to hive by executing below command.
ADD JAR /home/cloudera/Desktop/SharedData/hive-functions.jar
hive> ADD JAR /home/cloudera/Desktop/hive-functions.jar; Added [/home/cloudera/Desktop/hive-functions.jar] to class path Added resources: [/home/cloudera/Desktop/hive-functions.jar]
Create temporary function by executing below command.
CREATE TEMPORARY FUNCTION camel_case AS 'com.sample.app.CamelCaseConverter';
hive> CREATE TEMPORARY FUNCTION camel_case AS 'com.sample.app.CamelCaseConverter';
OK
Time taken: 0.362 seconds
Create t1 table and insert some data into it.
CREATE TABLE t1 (id INT, data STRING);
INSERT INTO t1 VALUES
(1, 'rate of interest'),
(2, 'time limit'),
(3, 'requests per second');
hive> CREATE TABLE t1 (id INT, data STRING);
OK
Time taken: 0.268 seconds
hive> ;
hive> ;
hive> INSERT INTO t1 VALUES
> (1, 'rate of interest'),
> (2, 'time limit'),
> (3, 'requests per second');
Query ID = root_20220420020404_1a25b17d-0da9-4cfc-91a9-e2331ee3353e
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1649172504056_0035, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1649172504056_0035/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1649172504056_0035
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2022-04-20 02:04:51,797 Stage-1 map = 0%, reduce = 0%
2022-04-20 02:05:00,944 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.64 sec
MapReduce Total cumulative CPU time: 1 seconds 640 msec
Ended Job = job_1649172504056_0035
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://quickstart.cloudera:8020/user/hive/warehouse/t1/.hive-staging_hive_2022-04-20_02-04-37_751_2440121772894137663-1/-ext-10000
Loading data to table default.t1
Table default.t1 stats: [numFiles=1, numRows=3, totalSize=54, rawDataSize=51]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.64 sec HDFS Read: 3818 HDFS Write: 120 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 640 msec
OK
Time taken: 24.611 seconds
hive> ;
hive> ;
hive> ;
hive> SELECT * FROM t1;
OK
1 rate of interest
2 time limit
3 requests per second
Time taken: 0.08 seconds, Fetched: 3 row(s)
hive>
Let’s apply the temp function ‘camel_case’ on t1.
hive> SELECT id, data, camel_case(data) FROM t1;
OK
1 rate of interest rateOfInterest
2 time limit timeLimit
3 requests per second requestsPerSecond
Time taken: 0.118 seconds, Fetched: 3 row(s)
One
problem with the temporary function is, it will get deleted once your hive
session is closed. We can solve this by creating a permanent function. I will
explain this in my next post.
Previous Next Home
No comments:
Post a Comment