1.0. What's in this blog?
In my previous blog on creating custom UDFs in Hive, I covered a sample basic UDF. This blog covers generic UDF creation, to mimic the same NVL2 functionality covered in the previous blog. It includes sample data, java code for creating the UDF, expected results, commands to execute and the output.
[hive 0.10]
[hive 0.10]
About UDFs:
UDF stands for User Defined Function. In Hive, there are (a) reusable functions available, as part of core Hive (out of the box) that can be used in Hive queries; They are called UDFs, even though they are not user-defined. And then there are (b) functions that one can create in Java, also called UDFs, and use in Hive queries. The focus of this blog is custom UDFs (b), specifically generic UDFs.
About generic UDF:
About NVL2:
UDFs in Hive have are extensions of either UDF or GenericUDF classes. GenericUDFs are more optimal from a performance perspective as they use short circuit evaluation and lazy evaluation, when compared to UDFs that use reflection. GenericUDFs support non-primitive Hive types like arrays, structs and maps in addition to primitive types, unlike UDFs that support only primitive types.
About NVL2:
NVL2 takes three parameters, we will refer to as expr1, expr2 and expr3.
NVL2 lets you determine the value returned by a query based on whether a specified expression is null or not null. If expr1 is not null, then NVL2 returns expr2. If expr1 is null, then NVL2 returns expr3.
2.0. NVL2 generic UDF in Hive
1: Create the test data file for a Hive external table
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
************************** | |
Input data | |
************************** | |
1. Execute locally on the node you are running Hive client from | |
----------------------------------------------------------------- | |
Create input file/data to use for the demo. | |
Since this gist is merely for instructional purpose, the dataset is small. | |
cd ~ | |
mkdir hiveProject | |
cd hiveProject | |
vi Departments_UDFTest | |
Paste this..ensuring the fields are delimited by tabs and record with new line. | |
d001 Marketing | |
d002 Finance | |
d003 Human Resources | |
d004 Production | |
d005 Development | |
d006 Quality Management | |
d007 Sales | |
d008 | |
d009 Customer Service | |
2. Hadoop commands | |
-------------------- | |
hadoop fs -mkdir hiveProject | |
hadoop fs -put hiveProject/Departments_UDFTest hiveProject |
2: Create the Hive table
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
********************************************* | |
Setting up the Hive table | |
********************************************* | |
In hive shell.... | |
a) Create table: | |
CREATE EXTERNAL TABLE IF NOT EXISTS departments_UDFTest | |
( | |
deptNo String, | |
deptName String | |
) | |
Row format delimited | |
fields terminated by '\t' | |
LOCATION '/user/akhanolk/hiveProject'; | |
b) Quick test: | |
Select * from departments_UDFTest; | |
d001 Marketing | |
d002 Finance | |
d003 Human Resources | |
d004 Production | |
d005 Development | |
d006 Quality Management | |
d007 Sales | |
d008 NULL | |
d009 Customer Service |
3: Create the UDF in Java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package khanolkar.HiveUDFs; | |
import org.apache.hadoop.hive.ql.exec.Description; | |
import org.apache.hadoop.hive.ql.exec.UDFArgumentException; | |
import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException; | |
import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException; | |
import org.apache.hadoop.hive.ql.metadata.HiveException; | |
import org.apache.hadoop.hive.ql.udf.UDFType; | |
import org.apache.hadoop.hive.ql.udf.generic.GenericUDF; | |
import org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils; | |
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; | |
@UDFType(deterministic = true) | |
@Description(name = "NVL2", value = "NVL2(expr1,expr2,expr3) returns expr3, if expr1 is null, otherwise returns expr2;", extended = "NVL2(expr1,expr2,expr3) returns expr3, if expr1 is null, otherwise retruns expr2") | |
public class NVL2GenericUDF extends GenericUDF { | |
private GenericUDFUtils.ReturnObjectInspectorResolver returnOIResolver; | |
private ObjectInspector[] argumentOIs; | |
@Override | |
public ObjectInspector initialize(ObjectInspector[] arguments) | |
throws UDFArgumentException { | |
argumentOIs = arguments; | |
// First check - do we have the right number of arguments? | |
if (arguments.length != 3) { | |
throw new UDFArgumentLengthException( | |
"The operator 'NVL2' accepts 3 arguments."); | |
} | |
// Second check - throw exception if any complex types have been passed | |
// as parameters | |
if (arguments[0].getCategory() != ObjectInspector.Category.PRIMITIVE | |
|| arguments[1].getCategory() != ObjectInspector.Category.PRIMITIVE | |
|| arguments[2].getCategory() != ObjectInspector.Category.PRIMITIVE) | |
throw new UDFArgumentTypeException(0, | |
"Only primitive type arguments are accepted"); | |
// Third check - throw exception if the data types across parameters are | |
// different | |
if (!(returnOIResolver.update(arguments[0]) && returnOIResolver | |
.update(arguments[1])) | |
|| !(returnOIResolver.update(arguments[1]) && returnOIResolver | |
.update(arguments[2]))) { | |
throw new UDFArgumentTypeException(2, | |
"The arguments of function NLV2 should have the same type, " | |
+ "but they are different: \"" | |
+ arguments[0].getTypeName() + "\" and \"" | |
+ arguments[1].getTypeName() + "\" and \"" | |
+ arguments[2].getTypeName() + "\""); | |
} | |
returnOIResolver = new GenericUDFUtils.ReturnObjectInspectorResolver( | |
true); | |
return returnOIResolver.get(); | |
} | |
@Override | |
public Object evaluate(DeferredObject[] arguments) throws HiveException { | |
// The NVL2 functionality | |
Object retVal = returnOIResolver.convertIfNecessary(arguments[0].get(), | |
argumentOIs[0]); | |
if (retVal == null) { | |
retVal = returnOIResolver.convertIfNecessary(arguments[2].get(), | |
argumentOIs[2]); | |
} else { | |
retVal = returnOIResolver.convertIfNecessary(arguments[1].get(), | |
argumentOIs[1]); | |
} | |
return retVal; | |
} | |
@Override | |
public String getDisplayString(String[] children) { | |
StringBuilder sb = new StringBuilder(); | |
sb.append("if "); | |
sb.append(children[0]); | |
sb.append(" is null "); | |
sb.append("returns"); | |
sb.append(children[2]); | |
sb.append("else "); | |
sb.append("returns"); | |
sb.append(children[1]); | |
return sb.toString(); | |
} | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Concepts | |
------------------------- | |
There are three methods- | |
1. initialize() - called once, at first. The goal of this method is to determine the return type | |
from the arguments. The user can also throw an Exception to signal that bad types are being sent | |
to the method. The returnOIResolver is a built-in class that determines the return type by finding | |
the type of non-null variables and using that type. The ObjectInspector is used to the transform | |
raw records into objects that Hive can access. The initialize() method is passed an | |
ObjectInspector for each argument | |
2. evaluate() - where the logic for the function should be written. | |
The evaluate method has access to the values passed to the method stored in an array of DeferredObject | |
values. The returnOIResolver created in the initialize method is used to get values from the | |
DeferredObjects. | |
3. getDisplayString() - The final method to override is getDisplayString(), is used inside the Hadoop | |
tasks to display debugging information when the function is being used. | |
Annotations: | |
@UDFType(deterministic = true) annotation: Indicates that the UDF returns the same value any time its | |
called | |
@Description(...) annotation: Includes information that is displayed when you do a describe on the UDF |
4: Expected results
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
**************************** | |
Expected results | |
**************************** | |
Query: | |
select deptNo,NVL2(deptName,deptName,'Procrastrination') from departments_UDFTest; | |
The null in the department name for department d008, should be returned as "Procrastrination". | |
For the rest of the records, the query should return the data in Hive, as is. | |
5: Try out the UDF
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
******************************* | |
Testing the UDF | |
******************************* | |
1) Add jar | |
hive> add jar hiveProject/jars/NVL2GenericUDF.jar; | |
2) Create alias for the function | |
hive> CREATE TEMPORARY FUNCTION NVL2 | |
AS 'khanolkar.HiveUDFs.NVL2GenericUDF'; | |
3) Test the description provided | |
hive> DESCRIBE FUNCTION NVL2; | |
OK | |
NVL2(expr1,expr2,expr3) returns expr3, if expr1 is null, otherwise returns expr2; | |
4) Test if there are checks in place for number of parameters | |
hive> select deptNo,NVL2(deptName,deptName) from departments_UDFTest; | |
FAILED: SemanticException [Error 10015]: Line 1:14 Arguments length mismatch 'deptName': The operator 'NVL2' accepts 3 arguments. | |
hive> select deptNo,NVL2(deptName,deptName,123,1) from departments_UDFTest; | |
FAILED: SemanticException [Error 10015]: Line 1:14 Arguments length mismatch '1': The operator 'NVL2' accepts 3 arguments. | |
5) Results | |
hive> select deptNo,NVL2(deptName,deptName,'Procrastrination') from departments_UDFTest; | |
OK | |
d001 Marketing | |
d002 Finance | |
d003 Human Resources | |
d004 Production | |
d005 Development | |
d006 Quality Management | |
d007 Sales | |
d008 Procrastrination | |
d009 Customer Service |
3.0. Making the UDF permanently available when you launch the hive shell
There are several ways to make a custom UDF available when you launch the Hive shell, bypassing the need to execute the "add jar..." statement before using a custom UDF. I have listed a couple of them.
Option 1:
From "Programming Hive"
Your function may also be added permanently to Hive, however this requires a small modification to a Hive Java file and then rebuilding Hive.
Inside the Hive source code, a one-line change is required to theFunctionRegistry
class found atql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java. Then you rebuild Hive following the instructions that come with the source distribution.
While it is recommended that you redeploy the entire new build, only the hive-exec-*.jar, where \* is the version number, needs to be replaced.
Option 2:
Add it to the .hiverc file on each node from where hive queries will be run.
Check out my blog - http://hadooped.blogspot.com/2013/08/hive-hiverc-file.html
4.0. References
Apache documentation:
http://hive.apache.org/docs/r0.10.0/api/org/apache/hadoop/hive/ql/udf/generic/GenericUDF.html
https://cwiki.apache.org/confluence/display/Hive/OperatorsAndFunctions
A good article on creating a UDF that involves non-primitive types - link
Programming Hive - from O'Reilly
That's it for this blog. Do share any additional insights with me.
Cheers!
Great work. Thanks for sharing your work.
ReplyDeleteThanks.
ReplyDeleteSaved me a full week's work. Where on earth are you based?
ReplyDeleteGlad the post helped you, Joseph.
ReplyDeleteI am based out of Chicago, IL.
thank you so much for the post !!!
ReplyDeleteHi,
ReplyDeleteI am trying to execute the same above genric udf and facing "NULL POINTER EXCEPTION"
hive> select * from departments_udftest;
OK
d001 marketing
d002 finance
d003 hr
d004
d005
d006 testing
Time taken: 0.238 seconds, Fetched: 6 row(s)
hive> select deptno,nvl2(deptname,deptname,'test') from departments_udftest;
FAILED: NullPointerException null
Can you please let me know if i am missing something here?
I guess you have already figured this out yourself, but for all others who struggle with this as I just did today: the line 52 in the java code (returnOIResolver = new GenericUDF...) should be moved before the third check. Otherwise you use object returnOIResolver before it is initiated.
ReplyDeleteThis comment has been removed by the author.
ReplyDeletethakyou it vry nice blog for beginners
ReplyDeletehttps://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/
Good post. Keep sharing. You can the training from,
ReplyDeleteMachine Learning training in Pallikranai Chennai
Pytorch training in Pallikaranai chennai
Data science training in Pallikaranai
Python Training in Pallikaranai chennai
Deep learning with Pytorch training in Pallikaranai chennai
Bigdata training in Pallikaranai chennai
Mongodb Nosql training in Pallikaranai chennai
Spark with ML training in Pallikaranai chennai
Data science Python training in Pallikaranai
Bigdata Spark training in Pallikaranai chennai
Sql for data science training in Pallikaranai chennai
Sql for data analytics training in Pallikaranai chennai
Sql with ML training in Pallikaranai chennai
As reported by Stanford Medical, It is indeed the ONLY reason women in this country live 10 years more and weigh an average of 19 kilos less than we do.
ReplyDelete(Just so you know, it is not related to genetics or some secret exercise and EVERYTHING to around "how" they are eating.)
BTW, I said "HOW", and not "WHAT"...
Click this link to discover if this quick quiz can help you release your real weight loss possibility