Ayadi Tahar | Master the Basics of Hbase CRUD Operations and Data Manipulations

Master the Basics of Hbase CRUD Operations and Data Manipulations

Publish Date: 2022-09-14


In columnar Databases, a table is a collection of rows. but what makes a difference is that the row is a key value pair where the key is the primary identifier for the record. while the value is one or more of column families, which is collection of columns that contain one or more columns or cells.

in this second part of our article related to HBase, we will demonstrate some HBase operations that cover create, read, insert, update and delete of data using HBase shell.

if you are unfamiliar with HBase , check out my previous article Columnar Databases: HBase Data Overview to get fast understanding and an overview of data models in HBase.

Crud Operations

lets start with namespaces level.

Create Namesapces

to get list of namespaces ( or databases), just type list_namespace in your hbase shell :

hbase(main):001:0> list_namespace

NAMESPACE
default
hbase
2 row(s) in 0.3970 seconds

if you want create a namespace, use create_namespace command followed by a namespace name in single quotes :


hbase(main):002:0> create_namespace 'class1'
0 row(s) in 0.1640 seconds

check and list namespaces again t see if it succeeds :


hbase(main):003:0> list_namespace
NAMESPACE
class1
default
hbase
3 row(s) in 0.0460 seconds

Create and list tables

as this article is a continuous of previous article , we will use the same table presented there. and for the record, this is a table we are using through the rest of our demo :

info grade
name gender age math physics science
20221 ahmed male 22 17 18 16
20222 sara female 20 14 16 15
20223 omar male 25 18 17 16
20224 layla female 18 15 14 13

to create a table Use the create command . You must specify the table name and the ColumnFamily(s) name. Use the next syntax to do just that:


hbase(main):004:0> create 'namespace:table', 'column_family'

if you don’t specify namespace it will use the default namespace. For our demo, lets create a students table in with 2 column families info and grade :


hbase(main):005:0> create 'students' , 'info', 'grade'
0 row(s) in 1.2520 seconds

=> Hbase::Table - students
List Information About your Table

you can use 'list' command to confirm that our students table is created successfully:


hbase(main):006:0> list
TABLE
students
1 row(s) in 0.0150 seconds

=> ["students"]

Now use the describe (you can use shortened version of it: desc ) command to see table details, including default configurations values :


hbase(main):007:0> describe 'students'
Table students is ENABLED
students
COLUMN FAMILIES DESCRIPTION
{NAME => 'grade', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
{NAME => 'info', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
2 row(s) in 0.1510 seconds

Insert data into a Table

now that we created our table, it is time to put some data to work with, you can use many approaches especially in bulk and continues insert, but for our case we will simply use shell as our data is relatively small.

To put data into your table, use the put command with this syntax:


hbase(main):000:0> put 'table_name', 'row_id', 'column_family:column', 'value'

the syntax seems confusing , and to make it clear, let us in the next command insert three values, one at a time, all corresponds to one row.

The first insert is 20221, column info:name, with a value of ahmed . Columns in HBase are comprised of a column family prefix, info in our example, followed by a colon and then a column qualifier suffix, name in our case:


hbase(main):026:0> put 'students','20221', 'info:name','ahmed'
0 row(s) in 0.2950 seconds

hbase(main):027:0> put 'students','20221', 'info:gender','male'
0 row(s) in 0.0120 seconds

hbase(main):028:0> put 'students','20221', 'info:age','22'
0 row(s) in 0.0160 seconds

in the same way, you can add more cell values for grade column family to the same student 20221:


hbase(main):029:0> put 'students','20221', 'grade:math','17'
0 row(s) in 0.0280 seconds

hbase(main):030:0> put 'students','20221', 'grade:physics','18'
0 row(s) in 0.0170 seconds

hbase(main):031:0> put 'students','20221', 'grade:science','16'
0 row(s) in 0.0220 seconds

Read and Select Data from table

if you want to select the data we just insert, we can use 'scan' command:


hbase(main):033:0> scan 'students'  # to show table data
ROW                          COLUMN+CELL
20221                       column=grade:math, timestamp=1662386270882, value=17
20221                       column=grade:physics, timestamp=1662386270967, value=18
20221                       column=grade:science, timestamp=1662386271032, value=16
20221                       column=info:age, timestamp=1662385287949, value=22
20221                       column=info:gender, timestamp=1662385287894, value=male
20221                       column=info:name, timestamp=1662385287819, value=ahmed
1 row(s) in 0.0710 seconds

the data returned shows the values correspond to each column field for the row 20221 with timestamp as well.

Now lets add more rows to get some meaningful queries. Here we will add 3 more students with 2 column families:


put 'students','20222', 'info:name','sara'
put 'students','20222', 'info:gender','female'
put 'students','20222', 'info:age','20'
put 'students','20223', 'info:name','omar'
put 'students','20223', 'info:gender','male'
put 'students','20223', 'info:age','25'
put 'students','20224', 'info:name','layla'
put 'students','20224', 'info:gender','female'
put 'students','20224', 'info:age','18'

put 'students','20222', 'grade:math','14'
put 'students','20222', 'grade:physics','16'
put 'students','20222', 'grade:science','15'
put 'students','20223', 'grade:math','18'
put 'students','20223', 'grade:physics','17'
put 'students','20223', 'grade:science','16'
put 'students','20224', 'grade:math','15'
put 'students','20224', 'grade:physics','14'
put 'students','20224', 'grade:science','13'

you can count the number of records in the students table to verify:


hbase(main):056:0> count 'students'
4 row(s) in 0.0320 seconds

=> 4

Query Data

Querying Data in Scan Mode

use scan command again to see how data look like, but be careful and don’t use it against large tables as it can cause performance issue ,because all the data will get fetched.

Another way query table data, is to scan only the data in the specific column family, lets for a moment select info column family, we can use next syntax to achieve that:


hbase(main):058:0> scan 'students',{COLUMNS=>'info'}
ROW                          COLUMN+CELL
20221                       column=info:age, timestamp=1662385287949, value=22
20221                       column=info:gender, timestamp=1662385287894, value=male
20221                       column=info:name, timestamp=1662385287819, value=ahmed
20222                       column=info:age, timestamp=1662387271716, value=20
20222                       column=info:gender, timestamp=1662387271626, value=female
20222                       column=info:name, timestamp=1662387271561, value=sara
20223                       column=info:age, timestamp=1662387271976, value=25
20223                       column=info:gender, timestamp=1662387271888, value=male
20223                       column=info:name, timestamp=1662387271840, value=omar
20224                       column=info:age, timestamp=1662387272219, value=18
20224                       column=info:gender, timestamp=1662387272084, value=female
20224                       column=info:name, timestamp=1662387272023, value=layla
4 row(s) in 0.0860 seconds

we can also query only the math mark in the grade column family:


hbase(main):060:0> scan 'students',{COLUMNS=>'grade:math'}
ROW                          COLUMN+CELL
20221                       column=grade:math, timestamp=1662387264427, value=17
20222                       column=grade:math, timestamp=1662387264583, value=14
20223                       column=grade:math, timestamp=1662387264795, value=18
20224                       column=grade:math, timestamp=1662387264965, value=15
4 row(s) in 0.0580 seconds
Querying Data in Get Mode

In Get mode, data is queried based on the row key. For example if we want to get related information for student '20224' :


hbase(main):061:0> get 'students','20224'
COLUMN             CELL
grade:math        timestamp=1662387264965, value=15
grade:physics     timestamp=1662387265005, value=14
grade:science     timestamp=1662387265054, value=13
info:age          timestamp=1662387272219, value=18
info:gender       timestamp=1662387272084, value=female
info:name         timestamp=1662387272023, value=layla
6 row(s) in 0.0580 seconds

we can also fetch for specific column for student '20224':


hbase(main):062:0> get 'students','20224','info:name'
COLUMN             CELL
info:name         timestamp=1662387272023, value=layla
1 row(s) in 0.0200 seconds
Querying Data by Specified Criteria

we can also query data by specifying some criteria. Lets get only student names , but limit the result to only 2 rows:


hbase(main):064:0> scan 'students',{STARTROW=>'20221', 'LIMIT'=>2, COLUMNS=>'info:name'}
ROW                          COLUMN+CELL
20221                       column=info:name, timestamp=1662385287819, value=ahmed
20222                       column=info:name, timestamp=1662387271561, value=sara
2 row(s) in 0.0360 seconds

Updates and Querying Multiversion Data

HBase can store multiples versions for the same row. In our case we will put (update) the student '20221' name, to have multiples versions:


hbase(main):065:0> put 'students','20221','info:name','Ahmed Ali'
0 row(s) in 0.0160 seconds

hbase(main):066:0> put 'students','20221','info:name','Ahmed Sghir'
0 row(s) in 0.0130 seconds

hbase(main):067:0> put 'students','20221','info:name','Ahmed Wald Nabil'
0 row(s) in 0.0150 seconds

if you query the student '20221' name:


hbase(main):069:0> get 'students','20221','info:name'
COLUMN                       CELL
info:name                   timestamp=1662390375880, value=Ahmed Wald Nabil
1 row(s) in 0.0110 seconds

as you can see, you only get the latest update, the same behavior as in relational databases. however, this is not what we want to !? we want to get all the previous updates (called versions in HBase) as well and this is what HBase is god on it.

You can set VERSIONS to specify the number of versions (updates to be queried:


hbase(main):070:0> get 'students','20221',{'COLUMNS'=>'info', VERSIONS=>5}
COLUMN                       CELL
info:age                    timestamp=1662385287949, value=22
info:gender                 timestamp=1662385287894, value=male
info:name                   timestamp=1662390375880, value=Ahmed Wald Nabil
3 row(s) in 0.0190 seconds

The version is specified during the search, but the last record is still displayed. Although VERSIONS is added, only one record is returned after the get operation. This is because the default value of VERSIONS is 1 during table creation. To verify that execute the next command and check VERSIONS property:


hbase(main):071:0> desc 'students'
Table students is ENABLED
students
COLUMN FAMILIES DESCRIPTION
{NAME => 'grade', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
{NAME => 'info', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
2 row(s) in 0.0810 seconds

To view data of multiple versions, run the following statement to change the value of VERSIONS of the table ( or you can specify the value when creating the table as well):


hbase(main):072:0> alter 'students', {NAME=>'info', 'VERSIONS'=>5}
Updating all regions with the new schema...
0/1 regions updated.
1/1 regions updated.
Done.
0 row(s) in 3.1770 seconds

Then, insert multiple data record updates to student '20221' names:


hbase(main):073:0> put 'students','20221','info:name','Ahmed Ali'
0 row(s) in 0.0230 seconds

hbase(main):074:0> put 'students','20221','info:name','Ahmed Ben Ali'
0 row(s) in 0.0180 seconds

hbase(main):075:0> put 'students','20221','info:name','Ahmed Wald Ali'
0 row(s) in 0.0240 seconds

now to get the last 5 names versions for student '20221', we just run get command as in the next command:


hbase(main):076:0> get 'students','20221',{'COLUMNS'=>'info', VERSIONS=>5}
COLUMN                       CELL
info:age                    timestamp=1662385287949, value=22
info:gender                 timestamp=1662385287894, value=male
info:name                   timestamp=1662391886964, value=Ahmed Wald Ali
info:name                   timestamp=1662391885641, value=Ahmed Ben Ali
info:name                   timestamp=1662391882289, value=Ahmed Ali
info:name                   timestamp=1662390375880, value=Ahmed Wald Nabil
info:name                   timestamp=1662385287819, value=ahmed
7 row(s) in 0.0350 seconds

Deleting Data records

if you want, you can delete specific single row field, for example lets check the age of student '20221':


hbase(main):078:0> get 'students','20221', 'info:age'
COLUMN                       CELL
info:age                    timestamp=1662385287949, value=22
1 row(s) in 0.0460 seconds

let’s delete this age value, that correspond to 'Ahmed' student:


hbase(main):079:0> delete 'students','20221','info:age'
0 row(s) in 0.0590 seconds

if we check again, the age value is gone, and it is no longer in the database:


hbase(main):080:0> get 'students','20221', 'info:age'
COLUMN                       CELL
0 row(s) in 0.0110 seconds

hbase(main):081:0> get 'students','20221'
COLUMN                       CELL
grade:math                  timestamp=1662387264427, value=17
grade:physics               timestamp=1662387264488, value=18
grade:science               timestamp=1662387264533, value=16
info:gender                 timestamp=1662385287894, value=male
info:name                   timestamp=1662391886964, value=Ahmed Wald Ali
5 row(s) in 0.0310 seconds

we can also delete all records of a specific row as well, using 'deleteall' command:


hbase(main):082:0> deleteall 'students','20221'
0 row(s) in 0.0150 seconds

Deleting Tables

If you want to delete a table or change its settings, as well as in some other situations, you need to disable the table first, using the disable command :


hbase(main):083:0> disable 'students'
0 row(s) in 2.4950 seconds

N.B: you can use 'enable' command in the same way if just change some settings.

Drop the table

To drop (delete) a table, use the drop command:


hbase(main):084:0> drop 'students'
0 row(s) in 1.2880 seconds
Drop a namespace

in the same way, you can drop a namesapace if you are no longer using it:

This command drop_namespace is used to drop the namespace . You can drop only empty namespace, so you must drop all the tables created in that namespace before attempting to drop it.

as our namespace 'class1' is empty, we can run the next command to do so:


hbase(main):085:0> drop_namespace 'class1'
0 row(s) in 1.3680 seconds

Conclusion

by the end of this demo, you now can use different commands to manipulate data in HBae, including create, put, get, scan, delete drop and many others.

Resources
HBase Official Guide