How to insert into a Hive table when the table has columns with complex types like Arrays?
January 26, 2022How to find out if a directory in HDFS is empty or not?
February 9, 2022Both get_json_object and json_tuple functions in Hive are meant to work with JSON data in Hive.
Let’s create a table with some JSON data to work with. We have created a table named hirw_courses and loaded the below JSON text into the table.
CREATE TABLE IF NOT EXISTS hirw_courses (course string) STORED AS TEXTFILE; {"course": {"course1": {"name": "Hadoop Developer In Real World","type": "online"}, "course2": {"name": "Spark Developer In Real World", "type": "online"}},"email": "info@hadoopinrealworld.com","owner": "Big Data"}
When the above JSON string is formatted it looks like below.
{ "course": { "course1": { "name": "Hadoop Developer In Real World", "type": "online" }, "course2": { "name": "Spark Developer In Real World", "type": "online" } }, "email": "info@hadoopinrealworld.com", "owner": "Big Data" }
get_json_obect
get_json_object() is a UDF and can take a JSON string as input and you can pass in the key that you want to be extracted from the JSON.
Here we are selecting 2 values from the JSON – owner and course. And we ended up calling the get_json_object twice.
SELECT get_json_object(course, '$.owner'), get_json_object(course, '$.course') FROM hirw_courses; +-----------+-------------------------------------------------------------------------------------------------------------------------------------------+--+ | _c0 | _c1 | +-----------+-------------------------------------------------------------------------------------------------------------------------------------------+--+ | Big Data | {"course1":{"name":"Hadoop Developer In Real World","type":"online"},"course2":{"name":"Spark Developer In Real World","type":"online"}} | +-----------+-------------------------------------------------------------------------------------------------------------------------------------------+--+ 1 row selected (0.125 seconds)
json_tuple
json_tuple is a User Defined Table generating Function (UDTF). <<Check out the post, if you want know the difference between UDF, UDAF and UDTFs)
With json_tuple you can pass in the json string followed by a list of keys for which you want to extract the values from JSON.
Since json_tuple is an UDTF, you can also use it with LATERAL VIEW which can be quite powerful.
SELECT json_tuple(course, 'owner', 'course') FROM hirw_courses; +-----------+-------------------------------------------------------------------------------------------------------------------------------------------+--+ | c0 | c1 | +-----------+-------------------------------------------------------------------------------------------------------------------------------------------+--+ | Big Data | {"course1":{"name":"Hadoop Developer In Real World","type":"online"},"course2":{"name":"Spark Developer In Real World","type":"online"}} | +-----------+-------------------------------------------------------------------------------------------------------------------------------------------+--+ 1 row selected (0.177 seconds)
get_json_object vs. json_tuple – which is better?
With get_json_object, if we are trying to extract 5 values from a JSON string we would have to call the function 5 times and this means the same JSON data is parsed through 5 times.
With json_tuple, we parse the JSON data only once even when we need to parse or extract 5 values from JSON.
So always use json_tuple in place of get_json_object for better performance.