Abstract
This article will focus on creating a custom HIVE UDF in the Scala programming language. Intellij IDEA 2016 was used to create the project and artifacts. Creation and testing of the UDF was performed on the Hortonworks Sandbox 2.4 using Oracle Virtual Box. The full source code for the project can be found here.
Create project.
Using Intellij IDEA, create a new project with the following configuration.
- Project type: Scala
- Project name: shoehorn
- Project SDK: 1.8 (java version 1.8)
- Scala SDK: scala-sdk-2.11.8
Create package
Once the project is created, add a new package under /shoehorn/src/ named: udf
Set dependencies
Edit your project structure and add the hive-exec-1.2.1.jar file to the module dependencies.
Create Artifact
Edit your project structure and add and artifact of type JAR>module with dependencies. For this example, I am adding the artifact to the shoehorn module.
Create Scala class
The first thing we need to do is create a Scala class referencing the org.apache.hadoop.hive.ql.exec.UDF library. For this example, the class name is ScalaUDF.
package udf
import org.apache.hadoop.hive.ql.exec.UDF
class ScalaUDF extends UDF {
}
Define function
Now we add our function definition inside the ScalaUDF definition. For this example, I'm creating a simple function that takes an input column of string type and returns the length of that string. This function is for demonstration purposes only as there is already a Hive function that provides the same functionality.
package udf
import org.apache.hadoop.hive.ql.exec.UDF
class ScalaUDF extends UDF {
  def evaluate(str: String): Int = {
    str.length()
  }
}
Create artifact
Using Intellij IDEA select Build>"Make Project" from the file menu. Next, select Build>"Build Artifacts...". This will create the /shoehorn/out/artifacts/shoehorn_jar/shoehorn.jar file.
Create Hive UDF
Upload the shoehorn.jar file to HDFS. You may need to change the file permissions depending on which user will be executing Hive commands. For this example, I've uploaded the file to my local Hortonworks Sandbox in the location: hdfs:///jars/shoehorn.jar
In hive, run the following command to register a new udf. Note: This can be done in the Hive view in Ambari or through the Hive CLI.
create function getScalaLength as 'udf.ScalaUDF' using jar 'hdfs:///jars/shoehorn.jar';
Finally, we can test our udf using the following HQL in Hive.
select phone_number, getScalaLength(phone_number) from xademo.customer_details limit 5;
The result set returned:
PHONE_NUM   9
5553947406  10
7622112093  10
5092111043  10
9392254909  10
Time taken: 6.38 seconds, Fetched: 5 row(s)