View on GitHub

Shoehorn

How to create Hive UDF in Scala

Download this project as a .zip file Download this project as a tar.gz file

Abstract

This article will focus on creating a custom HIVE UDF in the Scala programming language. Intellij IDEA 2016 was used to create the project and artifacts. Creation and testing of the UDF was performed on the Hortonworks Sandbox 2.4 using Oracle Virtual Box. The full source code for the project can be found here.

Create project.

Using Intellij IDEA, create a new project with the following configuration.

Project type: Scala
Project name: shoehorn
Project SDK: 1.8 (java version 1.8)
Scala SDK: scala-sdk-2.11.8

Create package

Once the project is created, add a new package under /shoehorn/src/ named: udf

Set dependencies

Edit your project structure and add the hive-exec-1.2.1.jar file to the module dependencies.

Create Artifact

Edit your project structure and add and artifact of type JAR>module with dependencies. For this example, I am adding the artifact to the shoehorn module.

Create Scala class

The first thing we need to do is create a Scala class referencing the org.apache.hadoop.hive.ql.exec.UDF library. For this example, the class name is ScalaUDF.

package udf

import org.apache.hadoop.hive.ql.exec.UDF

class ScalaUDF extends UDF {
}

Define function

Now we add our function definition inside the ScalaUDF definition. For this example, I'm creating a simple function that takes an input column of string type and returns the length of that string. This function is for demonstration purposes only as there is already a Hive function that provides the same functionality.

package udf

import org.apache.hadoop.hive.ql.exec.UDF

class ScalaUDF extends UDF {
  def evaluate(str: String): Int = {
    str.length()
  }
}

Create artifact

Using Intellij IDEA select Build>"Make Project" from the file menu. Next, select Build>"Build Artifacts...". This will create the /shoehorn/out/artifacts/shoehorn_jar/shoehorn.jar file.

Create Hive UDF

Upload the shoehorn.jar file to HDFS. You may need to change the file permissions depending on which user will be executing Hive commands. For this example, I've uploaded the file to my local Hortonworks Sandbox in the location: hdfs:///jars/shoehorn.jar

In hive, run the following command to register a new udf. Note: This can be done in the Hive view in Ambari or through the Hive CLI.

create function getScalaLength as 'udf.ScalaUDF' using jar 'hdfs:///jars/shoehorn.jar';

Finally, we can test our udf using the following HQL in Hive.

select phone_number, getScalaLength(phone_number) from xademo.customer_details limit 5;

The result set returned:

PHONE_NUM   9
5553947406  10
7622112093  10
5092111043  10
9392254909  10
Time taken: 6.38 seconds, Fetched: 5 row(s)