Hive XML Options

I have been looking into the XML options for Hive metastore tables for the past couple of days and have a few thoughts to share on what’s possible.

Hive

Option 1: Custom SerDe

The XML parsing options are quite similar to what we can achieve in Spark and MapReduce simply because the SerDe is written in Java anyway. Here’s an example of a table that we can create:

CREATE EXTERNAL TABLE TestTableXML (content STRING, link_id STRING)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
	"column.xpath.content"="/",
	"column.xpath.link_id"="//*[local-name()='link_id']/text()"
)
STORED AS
	INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
	OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION "/user/shannonh/test-xml"
TBLPROPERTIES (
	"xmlinput.start"="<body ",
	"xmlinput.end"="</body>"
);

The parser used is: XML-data-sources

Instead of using the Xpath UDF we are using an XML SerDe, that way the tables can be queried through standard SQL. However, this diminishes the power of nested XML and requires a lot of foolery to get the table definition right.

Option 2: Hive XPath UDF

The other option would be to query the XML blob using the XPath UDF like so:

SELECT
xpath(content, "//*[local-name()='link_id']/text()")
FROM TestTableXML;

Caveats

There are drawbacks in both instances when we need to use xml namespaces. In this example we have badly formatted XML documents with elements such as this:

<Message xmlns="http://www.shannonholgate.com/not/real">

The namespace here should be real, although it is not defined and XPath won’t pick it up as easy. This is why we are using the local-name() function in Option 1:

"column.xpath.correlation_id"="//*[local-name()='link_id']/text()"

Namespaces are not supported whatsoever in the Hive Xpath UDF, thus a custom UDF would need to be written.

Impala

Impala does not actually support custom SerDes so Option 1 isn’t possible.

In contrast, it should be possible to use the Hive XPath UDF in Option 2. You just need to copy the hive-exec.jar into HDFS, then create a custom function:

CREATE FUNCTION xpath(string,string)
RETURNS string
LOCATION '/user/impala/hive-exec.jar' symbol='org.apache.hadoop.hive.ql.udf.xml.UDFXPathString';

This is a nice feature as Impala is written in C++. Sadly I have not been able to use this successfully plus, the debug and error messages are not very descriptive:

WARNINGS: ImpalaRuntimeException: Unable to call create UDF instance.
CAUSED BY: InvocationTargetException: null
CAUSED BY: NullPointerException: null

The UDFXPathString class does not violate any of the restrictions dictated by Impala here as it only returns the first result, not an array. I will look into it further, when I get the time, as it could simply be a pesky permissions error, those guys cause all kinds of issues in the Hadoop world.

Summary

In conclusion, XML is OK in Hive but quite restricted without writing a lot of scripts to create tables or custom UDF’s to handle it.

References