Monday, November 4, 2013

UDF's Part 1: Custom simple eval UDFs in Pig and Hive (NVL2)


1.0. What's in this blog?


A demonstration of creating a custom simple eval UDF to mimic NVL2 functionality from the DBMS world, in Pig and Hive.  It includes sample data, java code for creating the UDF, expected results, commands to execute and the output.

About NVL2:
NVL2 takes three parameters, we will refer to as expr1, expr2 and expr3.
NVL2 lets you determine the value returned by a query based on whether a specified expression is null or not null. If expr1 is not null, then NVL2 returns expr2. If expr1 is null, then NVL2 returns expr3.

2.0. NVL2 UDF in Hive


1: Create the test data file for a Hive external table
**************************
Input data
**************************
1. Execute locally on the node you are running Hive client from
-----------------------------------------------------------------
Create input file/data to use for the demo.
Since this gist is merely for instructional purpose, the dataset is small.
cd ~
mkdir hiveProject
cd hiveProject
vi Departments_UDFTest
Paste this..ensuring the fields are delimited by tabs and record with new line.
d001 Marketing
d002 Finance
d003 Human Resources
d004 Production
d005 Development
d006 Quality Management
d007 Sales
d008
d009 Customer Service
2. Hadoop commands
--------------------
hadoop fs -mkdir hiveProject
hadoop fs -put hiveProject/Departments_UDFTest hiveProject
view raw 01-InputData hosted with ❤ by GitHub

2: Create the Hive table
*********************************************
Setting up the Hive table
*********************************************
In hive shell....
a) Create table:
CREATE EXTERNAL TABLE IF NOT EXISTS departments_UDFTest
(
deptNo String,
deptName String
)
Row format delimited
fields terminated by '\t'
LOCATION '/user/akhanolk/hiveProject';
b) Quick test:
Select * from departments_UDFTest;
d001 Marketing
d002 Finance
d003 Human Resources
d004 Production
d005 Development
d006 Quality Management
d007 Sales
d008 NULL
d009 Customer Service

3: Create the UDF in Java
//------------------------------------------------------------------
// Filename: NVL2.java
//------------------------------------------------------------------
package khanolkar.HiveUDFs;
import java.io.IOException;
import org.apache.hadoop.hive.ql.exec.UDF;
public class NVL2 extends UDF {
String expr1, expr2, expr3;
public NVL2() {
}
public String evaluate(String pExpr1, String pExpr2, String pExpr3)
throws IOException {
try {
expr1 = (String) pExpr1;
expr2 = (String) pExpr2;
expr3 = (String) pExpr3;
return (expr1 != null ? expr2 : expr3);
} catch (Exception e) {
// Cause task failure
throw new IOException("Error with Hive UDF, NVL2!", e);
}
}
}
view raw 03-UDFCode hosted with ❤ by GitHub

4: Expected results
****************************
Expected results
****************************
Query:
select deptNo,NVL2(deptName,deptName,'Procrastrination') from departments_UDFTest;
The null in the department name for department d008, should be returned as "Procrastrination".
For the rest of the records, the query should return the data in Hive, as is.

5: Test the UDF
*******************************
Testing the UDF
*******************************
hive> add jar hiveProject/jars/NVL2.jar;
hive> CREATE TEMPORARY FUNCTION NVL2
AS 'khanolkar.HiveUDFs.NVL2';
hive> select deptNo,NVL2(deptName,deptName) from departments_UDFTest;
FAILED: SemanticException [Error 10014]: Line 1:14 Wrong arguments 'deptName': No matching method for class khanolkar.HiveUDFs.NVL2 with (string, string). Possible choices: _FUNC_(string, string, string)
hive> select deptNo,NVL2(deptName,deptName,'Procrastrination') from departments_UDFTest;
OK
d001 Marketing
d002 Finance
d003 Human Resources
d004 Production
d005 Development
d006 Quality Management
d007 Sales
d008 Procrastrination
d009 Customer Service
view raw 05-HiveTest hosted with ❤ by GitHub

3.0. NVL2 UDF in Pig


We will reuse data from section 2.
1: Create the UDF in Java
package khanolkar.pigUDFs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
// Custom UDF
// Name: NVL2
// Parameters: Tuple with three Strings
// Purpose: Facilitates handling nulls + replacing non-null values
// If the first parameter is null, returns the third parameter,
// otherwise returns the second parameter
// E.g. NVL2(null,"Busy bee","Sloth") = "Sloth"
// E.g. NVL2("Anagha","Busy bee","Sloth") = "Busy bee"
// Returns: Null if tuple is empty
// Null if the three input parameters are not in the tuple
// Otherwise, Result of applying NVL2 logic
public class NVL2 extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try {
if (input.size() == 3) {
String expr1 = (String) input.get(0);
String expr2 = (String) input.get(1);
String expr3 = (String) input.get(2);
return (expr1 != null ? expr2 : expr3);
} else {
return null;
}
} catch (Exception e) {
// Cause task failure
throw new IOException("Error with UDF, NVL2!", e);
}
}
}
view raw 01-NVL2.java hosted with ❤ by GitHub

2: Create the pig script
#--------------------------------------------------------------------------------------
# Pig Script
# NVL2UDFDemo.pig
#--------------------------------------------------------------------------------------
register NVL2.jar;
define NVL2 khanolkar.pigUDFs.NVL2;
rawDS = load 'departments' using PigStorage() as (deptNo:chararray, deptName:chararray);
transformedDS = foreach rawDS generate $0, NVL2($1,$1,'Procrastination');
dump transformedDS;

3: Test the UDF
[Modify path of the data file between local and HDFS locations in the pig script - better - make it parameterized]
#---------------------------
# Command to test
#---------------------------
On the cluster
$ pig pigProject/evalFunc/NVL2/NVL2UDFDemo.pig
Locally
$ pig -x local pigProject/evalFunc/NVL2/NVL2UDFDemo.pig
view raw 05-Execute hosted with ❤ by GitHub

4: Results
#---------------------------
# Output data
#---------------------------
(d001,Marketing)
(d002,Finance)
(d003,Human Resources)
(d004,Production)
(d005,Development)
(d006,Quality Management)
(d007,Sales)
(d008,Procrastination)
(d009,Customer Service)
view raw 06-Output hosted with ❤ by GitHub



Do share any additional insights/comments.
Cheers!

Follow me on Twitter:

6 comments:

  1. Really a good piece of knowledge on Big Data and Hadoop. Thanks for such a good post. I would like to recommend one more resource NPN Training which helps in getting more knowledge on Hadoop. The best part of NPN Training is they provide complete Hands-on classes.
    http://npntraining.com/courses/big-data-and-hadoop.php

    ReplyDelete
  2. thakyou it vry nice blog for beginners
    https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/

    ReplyDelete