How to pivot and unpivot a DataFrame in Spark?
March 23, 2022How to fix could not connect to the endpoint URL issue in S3?
April 6, 2022How to parse information from URL in Hive?
Hive offers 2 functions to work with URLS – parse_url and parse_url_tuple.
With both functions you can extract information like – PROTOCOL, HOST, PATH, QUERY, Query parameters etc.
Let’s see them in action.
parse_url
parse_url is a user defined function. It takes in the URL and the information that we are trying to extract.
We use the key words HOST and PROTOCOL to extract host and protocol respectively.
SELECT parse_url('https://hadoopinrealworld.com/pages?key1=value1&key2=value2', 'PROTOCOL'), parse_url('https://hadoopinrealworld.com/pages?key1=value1&key2=value2', 'HOST'); +--------+------------------------+--+ | _c0 | _c1 | +--------+------------------------+--+ | https | hadoopinrealworld.com | +--------+------------------------+--+ 1 row selected (0.183 seconds)
parse_url_tuple
parse_url_tuple() is a User Defined Table generating Function (UDTF). <<Check out the post, if you want know the difference between UDF, UDAF and UDTFs)
SELECT parse_url_tuple('https://hadoopinrealworld.com/pages?key1=value1&key2=value2', 'PROTOCOL', 'HOST', 'PATH', 'QUERY', 'QUERY:key1') 0: jdbc:hive2://ms2.hirw.com:2181,wk1.hirw.co> ; +--------+------------------------+---------+--------------------------+---------+--+ | c0 | c1 | c2 | c3 | c4 | +--------+------------------------+---------+--------------------------+---------+--+ | https | hadoopinrealworld.com | /pages | key1=value1&key2=value2 | value1 | +--------+------------------------+---------+--------------------------+---------+--+ 1 row selected (0.615 seconds)
parse_url vs. parse_url_tuple – which is better?
With parse_url, if we are trying to extract 5 values from the URL we would have to call the function 5 times and this means the same URL is parsed through 5 times.
With parse_url_tuple, we parse the URL only once even when we need to parse or extract 5 values from the URL.
So always use parse_url_tuple in place of parse_url for better performance.