What is the difference between INNER JOIN and LEFT SEMI JOIN in Hive?

Definitive guide on Spark join algorithms
March 12, 2021
How to get a specific version of a file from S3 using AWS CLI?
March 17, 2021
Definitive guide on Spark join algorithms
March 12, 2021
How to get a specific version of a file from S3 using AWS CLI?
March 17, 2021

Both INNER JOIN and LEFT SEMI JOIN return matching records between both tables with a subtle difference.

Let’s consider 2 tables – employee and employee_department_mapping with the below data.

employee (2 columns - e_id and e_name)
10, Tom
20, Jerry 
30, Emily
employee_department_mapping (2 columns - e_id and d_name)
10, IT
10, Sales
20, HR

INNER JOIN

Here we INNER JOIN the 2 tables on e_id.

SELECT e.e_id, e.e_name, d.d_name FROM 
employee e INNER JOIN employee_department_mapping d
on e.e_id = d.e_id

INNER JOIN repeats the matching record from the left hand side table multiple times for each matching record in the right hand side.

10, Tom, IT
10, Tom, Sales
20, Jerry, HR

LEFT SEMI JOIN

Here we LEFT SEMI JOIN the 2 tables on e_id.

SELECT e.e_id, e.e_name, d.d_name FROM 
employee e LEFT SEMI JOIN employee_department_mapping d
on e.e_id = d.e_id

With LEFT SEMI JOIN, we get only the first matching record in the left hand side table in the output. As soon as a match is hit, the join will move to the next e_id.

10, Tom, IT
20, Jerry, HR

LEFT SEMI JOIN is similar to the EXISTS query below

SELECT name FROM employee e 
WHERE EXISTS (SELECT * FROM employee_department_mapping d WHERE (e.e_id=b.d_id))

When to use INNER JOIN and when to use LEFT SEMI JOIN

Use INNER JOIN if you want to repeat the matching record from the left hand side table multiple times for each matching record in the right hand side.

Use LEFT SEMI JOIN if you want to list the matching record from the left hand side table only once for each matching record in the right hand side.

LEFT SEMI JOIN is better performant when compared to the INNER JOIN.

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

What is the difference between INNER JOIN and LEFT SEMI JOIN in Hive?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X