Definitive guide on Spark join algorithms
March 12, 2021How to get a specific version of a file from S3 using AWS CLI?
March 17, 2021Both INNER JOIN and LEFT SEMI JOIN return matching records between both tables with a subtle difference.
Let’s consider 2 tables – employee and employee_department_mapping with the below data.
employee (2 columns - e_id and e_name) 10, Tom 20, Jerry 30, Emily
employee_department_mapping (2 columns - e_id and d_name) 10, IT 10, Sales 20, HR
INNER JOIN
Here we INNER JOIN the 2 tables on e_id.
SELECT e.e_id, e.e_name, d.d_name FROM employee e INNER JOIN employee_department_mapping d on e.e_id = d.e_id
INNER JOIN repeats the matching record from the left hand side table multiple times for each matching record in the right hand side.
10, Tom, IT 10, Tom, Sales 20, Jerry, HR
LEFT SEMI JOIN
Here we LEFT SEMI JOIN the 2 tables on e_id.
SELECT e.e_id, e.e_name, d.d_name FROM employee e LEFT SEMI JOIN employee_department_mapping d on e.e_id = d.e_id
With LEFT SEMI JOIN, we get only the first matching record in the left hand side table in the output. As soon as a match is hit, the join will move to the next e_id.
10, Tom, IT 20, Jerry, HR
LEFT SEMI JOIN is similar to the EXISTS query below
SELECT name FROM employee e WHERE EXISTS (SELECT * FROM employee_department_mapping d WHERE (e.e_id=b.d_id))
When to use INNER JOIN and when to use LEFT SEMI JOIN
Use INNER JOIN if you want to repeat the matching record from the left hand side table multiple times for each matching record in the right hand side.
Use LEFT SEMI JOIN if you want to list the matching record from the left hand side table only once for each matching record in the right hand side.
LEFT SEMI JOIN is better performant when compared to the INNER JOIN.