How to check size of a directory in HDFS?
January 18, 2021How does Shuffle Sort Merge Join work in Spark?
January 22, 2021Hive stores metadata information about the tables created in Hive in a relational database like Derby, MySQL etc. The metadata information includes table name, structure of table, partition information, location of the datasets etc.
Note that Hive does not store or manage the data behind the tables in Hive.
Internal tables
All metadata information of internal tables is managed by Hive
When an internal table is dropped, Hive will also drop the data relevant to the table.
External tables
Like internal tables, all metadata information of external tables are managed by Hive.
Unlike internal tables, when an external table is dropped, Hive will not drop the data relevant to the table.
When to use an internal table and when to use an external table?
A good use case to use an internal table is when you are using Hive to hold some intermediate data. In that case, when you drop the table you also want the data behind the table to be dropped.
Internal tables also make sense when you drop and recreate tables in Hive quite a lot. In that case you may not want to keep accumulating data.
In most cases, external tables make sense. In most real world scenarios your Hive table is probably fed by external processes like Spark jobs and consumed by applications outside Hive. In such instances Hive is used merely to hold the metadata and data is actually managed by processes outside of Hive so it makes sense to keep the data intact when we drop the Hive table.
To make this simple, when in doubt, always create an external table.