I have been looking into the XML options for Hive metastore tables for the past couple of days and have a few thoughts to share on what’s possible.
The XML parsing options are quite similar to what we can achieve in Spark and MapReduce simply because the SerDe is written in Java anyway. Here’s an example of a table that we can create:
CREATE EXTERNAL TABLE TestTableXML (content STRING, link_id STRING) ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe' WITH SERDEPROPERTIES ( "column.xpath.content"="/", "column.xpath.link_id"="//*[local-name()='link_id']/text()" ) STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' LOCATION "/user/shannonh/test-xml" TBLPROPERTIES ( "xmlinput.start"="<body ", "xmlinput.end"="</body>" );
The parser used is: XML-data-sources
Instead of using the Xpath UDF we are using an XML SerDe, that way the tables can be queried through standard SQL. However, this diminishes the power of nested XML and requires a lot of foolery to get the table definition right.
The other option would be to query the XML blob using the XPath UDF like so:
SELECT xpath(content, "//*[local-name()='link_id']/text()") FROM TestTableXML;
There are drawbacks in both instances when we need to use xml namespaces. In this example we have badly formatted XML documents with elements such as this:
The namespace here should be
real, although it is not defined and XPath won’t pick it up as easy. This is why we are using the
local-name() function in Option 1:
Namespaces are not supported whatsoever in the Hive Xpath UDF, thus a custom UDF would need to be written.
Impala does not actually support custom SerDes so Option 1 isn’t possible.
In contrast, it should be possible to use the Hive XPath UDF in Option 2. You just need to copy the hive-exec.jar into HDFS, then create a custom function:
CREATE FUNCTION xpath(string,string) RETURNS string LOCATION '/user/impala/hive-exec.jar' symbol='org.apache.hadoop.hive.ql.udf.xml.UDFXPathString';
This is a nice feature as Impala is written in C++. Sadly I have not been able to use this successfully plus, the debug and error messages are not very descriptive:
WARNINGS: ImpalaRuntimeException: Unable to call create UDF instance. CAUSED BY: InvocationTargetException: null CAUSED BY: NullPointerException: null
UDFXPathString class does not violate any of the restrictions dictated by Impala here as it only returns the first result, not an array. I will look into it further, when I get the time, as it could simply be a pesky permissions error, those guys cause all kinds of issues in the Hadoop world.
In conclusion, XML is OK in Hive but quite restricted without writing a lot of scripts to create tables or custom UDF’s to handle it.