Ayadi Tahar | External Vs Managed Tables in Hive

External Vs Managed Tables in Hive

Publish Date: 2022-06-21

If you just start learning Big Data technologies, You might not know that there is 2 main basic types of tables in Apache Hive. Knowing the difference between them and when to use one in place of other, can give you great results and impact your data management. that and more what we will cover in our article along with hands on demos.

1. differences between External and managed tables

Hive Managed Table

if you have been creating hive tables using generic SQL format, then you are probably creating managed table even if you don’t notice because is the default format used by hive .

managed tables has full control over data, that means if when you delete the tables, the data belongs to it (files and dataset ) also deleted and goes wit it.

The only case to use this kind of tables if hive is the only application who use this dataset in hdfs.

Hive External Table

in the other hand, external tables don't have full control over dataset files, that means a table has read access to hdfs data , and when you delete a table the original data stay non modified.

you use this type of tables when the dataset is shared among other apps(pig, mapreduce,…) and it’s the most used in compare to managed tables.

you must specify explicitly the type external by adding a word EXTERNAL in SQL syntax of table creation in order for the table to be external otherwise it will be managed by default.

a. Load the data into HDFS

the first step we have to do is to get the dataset, you can download the file from my repository:


$ wget https://github.com/ayaditahar/hive/blob/main/data/constituents.csv

the file is of type csv and contain informations about 500 constituents, we can add this file into hdfs by issuing these next commands:


0: jdbc:hive2://ubuntu21:10000>!sh hdfs dfs -mkdir /user/ahmed/data/constituents
0: jdbc:hive2://ubuntu21:10000>!sh hdfs dfs -put constituents.csv /user/ahmed/data/constituents
0: jdbc:hive2://ubuntu21:10000>!sh hdfs dfs -ls /user/ahmed/data/constituents

N.B: I am executing these commands directly inside Beeline shell, if you are executing it outside hive shell,it is fine, just omit exclamation mark ans “sh” ( !sh) in the beginning.

b. Create a database

after that , we have to create a database or choose an existing one in order to create table. Just run the next command to create and use a new one:


0: jdbc:hive2://ubuntu21:10000> CREATE DATABASE IF NOT EXISTS constituents_db;
No rows affected (1.795 seconds)
0: jdbc:hive2://ubuntu21:10000> USE constituents_db;
No rows affected (0.107 seconds)
0: jdbc:hive2://ubuntu21:10000> SHOW DATABASES;
+------------------+
|  database_name   |
+------------------+
| constituents_db  |
| default          |
+------------------+

2. Create a Managed Table

to create a managed table , just follow the general generic SQL sysntax, and the created table will be by dfault managed . For our case


0: jdbc:hive2://ubuntu21:10000> CREATE TABLE IF NOT EXISTS constituents_tb(
. . . . . . . . . . . . . >     symbol STRING,
. . . . . . . . . . . . . >     name STRING,
. . . . . . . . . . . . . >     sector STRING
. . . . . . . . . . . . . > ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
No rows affected (0.777 seconds)

to get info about the table like: column names, type of table, it’s location and other infos; issue the next command (some output is omitted from the screen):


0: jdbc:hive2://ubuntu21:10000> DESCRIBE FORMATTED constituents_tb;
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
|           col_name            |                     data_type                      |                      comment                       |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| # col_name                    | data_type                                          | comment                                            |
| symbol                        | string                                             |                                                    |
| name                          | string                                             |                                                    |
| sector                        | string                                             |                                                    |
| # Detailed Table Information  | NULL                                               | NULL                                               |
| Database:                     | constituents_db                                    | NULL                                               |
| CreateTime:                   | Tue Jun 21 18:27:28 CET 2022                       | NULL                                               |
| Location:                     | hdfs://ubuntu21:9000/user/hive/warehouse/constituents_db.db/constituents_tb | NULL                                               |
| Table Type:                   | MANAGED_TABLE                                      | NULL                                               |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+

as you can see the table is of type managed in property "Table Type".

a. populate a table with data:

if you try to query the table, you will see it is empty:


0: jdbc:hive2://ubuntu21:10000> select * from constituents_tb;
+-------------------------+-----------------------+-------------------------+
| constituents_tb.symbol  | constituents_tb.name  | constituents_tb.sector  |
+-------------------------+-----------------------+-------------------------+
+-------------------------+-----------------------+-------------------------+

you can add data to table by just running this next command:


0: jdbc:hive2://ubuntu21:10000> LOAD DATA INPATH '/user/ahmed/data/constituents'
. . . . . . . . . . . . . > INTO TABLE constituents_tb;

now, if you query a table again , you will get results (we limit to only 5 rows to fit in screen) :


0: jdbc:hive2://ubuntu21:10000> select * from constituents_tb limit 5;
+-------------------------+-----------------------+-------------------------+
| constituents_tb.symbol  | constituents_tb.name  | constituents_tb.sector  |
+-------------------------+-----------------------+-------------------------+
| Symbol                  | Name                  | Sector                  |
| MMM                     | 3M Company            | Industrials             |
| AOS                     | A.O. Smith Corp       | Industrials             |
| ABT                     | Abbott Laboratories   | Health Care             |
| ABBV                    | AbbVie Inc.           | Health Care             |
+-------------------------+-----------------------+-------------------------+

b. delete table

now let’s delete the table:


0: jdbc:hive2://ubuntu21:10000> DROP TABLE constituents_tb;
No rows affected (0.200 seconds)

if you check the table location mentioned earlier in table above(describe command ), you will notice that it is gone:


0: jdbc:hive2://ubuntu21:10000> !sh hdfs dfs -ls /user/hive/warehouse/constituents_db.db/

as well the original location of file itself:


0: jdbc:hive2://ubuntu21:10000> !sh hdfs dfs -ls /user/ahmed/data/constituents

both commands return nothing which means the dataset is no longer exists.

3. Create External Table

because data was deleted in previous step, we have to load it again into hdfs :


0: jdbc:hive2://ubuntu21:10000>!sh hdfs dfs -put constituents.csv /user/ahmed/data/constituents
0: jdbc:hive2://ubuntu21:10000> !sh hdfs dfs -ls /user/ahmed/data/constituents
Found 1 items
-rw-r--r--   1 ahmed supergroup      19182 2022-06-21 20:26 /user/ahmed/data/constituents/constituents.csv

to create an external table, we only have to add a word EXTERNAL after create, just like in our case:


0: jdbc:hive2://ubuntu21:10000> CREATE EXTERNAL TABLE IF NOT EXISTS constituents_tb(
. . . . . . . . . . . . . > symbol STRING,
. . . . . . . . . . . . . > name STRING,
. . . . . . . . . . . . . > sector STRING
. . . . . . . . . . . . . > ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
. . . . . . . . . . . . . > TBLPROPERTIES ("skip.header.line.count"="1");
No rows affected (0.199 seconds)

if we look to detail about the table we just create:


0: jdbc:hive2://ubuntu21:10000> DESCRIBE FORMATTED constituents_tb;
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
|           col_name            |                     data_type                      |                      comment                       |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| # col_name                    | data_type                                          | comment                                            |
| symbol                        | string                                             |                                                    |
| name                          | string                                             |                                                    |
| sector                        | string                                             |                                                    |

| Database:                     | constituents_db                                    | NULL                                               |
| CreateTime:                   | Tue Jun 21 20:29:15 CET 2022                       | NULL                                               |
| Location:                     | hdfs://ubuntu21:9000/user/hive/warehouse/constituents_db.db/constituents_tb | NULL                                               |
| Table Type:                   | EXTERNAL_TABLE                                     | NULL                                               |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+

as you can see, because we didn't specify the location during table creation, the location by default will be the same if the table is the same. and you notice this time the table type is External as mentioned in the output.

a. LOAD DATA INTO TABLE

now let’s populate the table with some data, but before that we can check that the table is empty:


0: jdbc:hive2://ubuntu21:10000> select * from constituents_tb ;
+-------------------------+-----------------------+-------------------------+
| constituents_tb.symbol  | constituents_tb.name  | constituents_tb.sector  |
+-------------------------+-----------------------+-------------------------+
+-------------------------+-----------------------+-------------------------+
No rows selected (0.27 seconds)

you can load the previous file copied into HDFS into our table by using this syntax:


0: jdbc:hive2://ubuntu21:10000> LOAD DATA INPATH '/user/ahmed/data/constituents'
. . . . . . . . . . . . . > INTO TABLE constituents_tb;
No rows affected (0.429 seconds)

now query a table again:


0: jdbc:hive2://ubuntu21:10000> select * from constituents_tb limit 6;
+-------------------------+-----------------------+-------------------------+
| constituents_tb.symbol  | constituents_tb.name  | constituents_tb.sector  |
+-------------------------+-----------------------+-------------------------+
| MMM                     | 3M Company            | Industrials             |
| AOS                     | A.O. Smith Corp       | Industrials             |
| ABT                     | Abbott Laboratories   | Health Care             |
| ABBV                    | AbbVie Inc.           | Health Care             |
| ACN                     | Accenture plc         | Information Technology  |
| ATVI                    | Activision Blizzard   | Information Technology  |
+-------------------------+-----------------------+-------------------------+
6 rows selected (0.316 seconds)

b. delete table

now lets check the behavior of external table after deletion. Run this command to drop the table:


0: jdbc:hive2://ubuntu21:10000> DROP TABLE constituents_tb;
No rows affected (0.26 seconds)
0: jdbc:hive2://ubuntu21:10000> show tables;
+-----------+
| tab_name  |
+-----------+
+-----------+
No rows selected (0.101 seconds)

if you check if table location still exist or not, you will find that the dataset should still there:


0: jdbc:hive2://ubuntu21:10000> !sh hdfs dfs -ls /user/hive/warehouse/constituents_db.db/constituents_tb
Found 1 items
-rw-r--r--   1 ahmed supergroup      19182 2022-06-21 20:26 /user/hive/warehouse/constituents_db.db/constituents_tb/constituents.csv

but if you check the original location:


0: jdbc:hive2://ubuntu21:10000> !sh hdfs dfs -ls /user/ahmed/data/constituents

the command return nothing !, the reason of that isn’t related to table type whether if it was managed or external, but because when we populate the table with data usin LOAD command , which move the data instead of copying it.

4 - Clean the data

after we're done with the demos you can clean the state by remove the tables and the database:


0: jdbc:hive2://ubuntu21:10000> DROP TABLE IF EXISTS constituents_tb;
No rows affected (0.209 seconds)
0: jdbc:hive2://ubuntu21:10000> DROP DATABASE IF EXISTS constituents_db;
No rows affected (0.277 seconds)

CONCLUSION

by now, if you completed all the demo, you know the difference between managed and external table in Hive, the use case of each one, and how to create them step by step.

If you have any question or have a difficulties , just leave a comment below.

External Vs Managed Tables in Hive

1. differences between External and managed tables

Hive Managed Table

Hive External Table

a. Load the data into HDFS

b. Create a database

2. Create a Managed Table

a. populate a table with data:

b. delete table

3. Create External Table

a. LOAD DATA INTO TABLE

b. delete table

4 - Clean the data

CONCLUSION

Data Structures

Data Engineering

Algorithms

Data Science

Linux

Deploy Minio in Openshift

Run Ansible playbooks Using Red Hat Satellite

Controlling pod placement onto nodes in OpenShift

Create a storage class for NFS dynamic storage provisioning in OpenShift

Windows management with Ansible

Running Spark on Kubernetes with AKS

Data Lakehouse, The Best of Both Worlds

CAP Theorem, Does it still hold in modern days

Pandas API over Pyspark

External Vs Managed Tables in Hive

1. differences between External and managed tables

Hive Managed Table

Hive External Table

a. Load the data into HDFS

b. Create a database

2. Create a Managed Table

a. populate a table with data:

b. delete table

3. Create External Table

a. LOAD DATA INTO TABLE

b. delete table

4 - Clean the data

CONCLUSION

Leave comment:

Show comments: (0)