Publish Date: 2022-09-14
In columnar Databases, a table is a collection of rows. but what makes a difference is that the row is a key value pair where the key is the primary identifier for the record. while the value is one or more of column families, which is collection of columns that contain one or more columns or cells.
in this second part of our article related to HBase, we will demonstrate some HBase operations that cover create, read, insert, update and delete of data using HBase shell.
if you are unfamiliar with HBase , check out my previous article Columnar Databases: HBase Data Overview to get fast understanding and an overview of data models in HBase.
lets start with namespaces level.
hbase(main):001:0> list_namespace
NAMESPACE
default
hbase
2 row(s) in 0.3970 seconds
if you want create a namespace, use create_namespace command followed by a namespace name in single quotes :
hbase(main):002:0> create_namespace 'class1'
0 row(s) in 0.1640 seconds
check and list namespaces again t see if it succeeds :
hbase(main):003:0> list_namespace
NAMESPACE
class1
default
hbase
3 row(s) in 0.0460 seconds
as this article is a continuous of previous article , we will use the same table presented there. and for the record, this is a table we are using through the rest of our demo :
info | grade | |||||
---|---|---|---|---|---|---|
name | gender | age | math | physics | science | |
20221 | ahmed | male | 22 | 17 | 18 | 16 |
20222 | sara | female | 20 | 14 | 16 | 15 |
20223 | omar | male | 25 | 18 | 17 | 16 |
20224 | layla | female | 18 | 15 | 14 | 13 |
to create a table Use the create command . You must specify the table name and the ColumnFamily(s) name. Use the next syntax to do just that:
hbase(main):004:0> create 'namespace:table', 'column_family'
if you don’t specify namespace it will use the default namespace. For our demo, lets create a students table in with 2 column families info and grade :
hbase(main):005:0> create 'students' , 'info', 'grade'
0 row(s) in 1.2520 seconds
=> Hbase::Table - students
you can use 'list' command to confirm that our students table is created successfully:
hbase(main):006:0> list
TABLE
students
1 row(s) in 0.0150 seconds
=> ["students"]
Now use the describe (you can use shortened version of it: desc ) command to see table details, including default configurations values :
hbase(main):007:0> describe 'students'
Table students is ENABLED
students
COLUMN FAMILIES DESCRIPTION
{NAME => 'grade', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
{NAME => 'info', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
2 row(s) in 0.1510 seconds
now that we created our table, it is time to put some data to work with, you can use many approaches especially in bulk and continues insert, but for our case we will simply use shell as our data is relatively small.
To put data into your table, use the put command with this syntax:
hbase(main):000:0> put 'table_name', 'row_id', 'column_family:column', 'value'
the syntax seems confusing , and to make it clear, let us in the next command insert three values, one at a time, all corresponds to one row.
The first insert is 20221, column info:name, with a value of ahmed . Columns in HBase are comprised of a column family prefix, info in our example, followed by a colon and then a column qualifier suffix, name in our case:
hbase(main):026:0> put 'students','20221', 'info:name','ahmed'
0 row(s) in 0.2950 seconds
hbase(main):027:0> put 'students','20221', 'info:gender','male'
0 row(s) in 0.0120 seconds
hbase(main):028:0> put 'students','20221', 'info:age','22'
0 row(s) in 0.0160 seconds
in the same way, you can add more cell values for grade column family to the same student 20221:
hbase(main):029:0> put 'students','20221', 'grade:math','17'
0 row(s) in 0.0280 seconds
hbase(main):030:0> put 'students','20221', 'grade:physics','18'
0 row(s) in 0.0170 seconds
hbase(main):031:0> put 'students','20221', 'grade:science','16'
0 row(s) in 0.0220 seconds
if you want to select the data we just insert, we can use 'scan' command:
hbase(main):033:0> scan 'students' # to show table data
ROW COLUMN+CELL
20221 column=grade:math, timestamp=1662386270882, value=17
20221 column=grade:physics, timestamp=1662386270967, value=18
20221 column=grade:science, timestamp=1662386271032, value=16
20221 column=info:age, timestamp=1662385287949, value=22
20221 column=info:gender, timestamp=1662385287894, value=male
20221 column=info:name, timestamp=1662385287819, value=ahmed
1 row(s) in 0.0710 seconds
the data returned shows the values correspond to each column field for the row 20221 with timestamp as well.
Now lets add more rows to get some meaningful queries. Here we will add 3 more students with 2 column families:
put 'students','20222', 'info:name','sara'
put 'students','20222', 'info:gender','female'
put 'students','20222', 'info:age','20'
put 'students','20223', 'info:name','omar'
put 'students','20223', 'info:gender','male'
put 'students','20223', 'info:age','25'
put 'students','20224', 'info:name','layla'
put 'students','20224', 'info:gender','female'
put 'students','20224', 'info:age','18'
put 'students','20222', 'grade:math','14'
put 'students','20222', 'grade:physics','16'
put 'students','20222', 'grade:science','15'
put 'students','20223', 'grade:math','18'
put 'students','20223', 'grade:physics','17'
put 'students','20223', 'grade:science','16'
put 'students','20224', 'grade:math','15'
put 'students','20224', 'grade:physics','14'
put 'students','20224', 'grade:science','13'
you can count the number of records in the students table to verify:
hbase(main):056:0> count 'students'
4 row(s) in 0.0320 seconds
=> 4
use scan command again to see how data look like, but be careful and don’t use it against large tables as it can cause performance issue ,because all the data will get fetched.
Another way query table data, is to scan only the data in the specific column family, lets for a moment select info column family, we can use next syntax to achieve that:
hbase(main):058:0> scan 'students',{COLUMNS=>'info'}
ROW COLUMN+CELL
20221 column=info:age, timestamp=1662385287949, value=22
20221 column=info:gender, timestamp=1662385287894, value=male
20221 column=info:name, timestamp=1662385287819, value=ahmed
20222 column=info:age, timestamp=1662387271716, value=20
20222 column=info:gender, timestamp=1662387271626, value=female
20222 column=info:name, timestamp=1662387271561, value=sara
20223 column=info:age, timestamp=1662387271976, value=25
20223 column=info:gender, timestamp=1662387271888, value=male
20223 column=info:name, timestamp=1662387271840, value=omar
20224 column=info:age, timestamp=1662387272219, value=18
20224 column=info:gender, timestamp=1662387272084, value=female
20224 column=info:name, timestamp=1662387272023, value=layla
4 row(s) in 0.0860 seconds
we can also query only the math mark in the grade column family:
hbase(main):060:0> scan 'students',{COLUMNS=>'grade:math'}
ROW COLUMN+CELL
20221 column=grade:math, timestamp=1662387264427, value=17
20222 column=grade:math, timestamp=1662387264583, value=14
20223 column=grade:math, timestamp=1662387264795, value=18
20224 column=grade:math, timestamp=1662387264965, value=15
4 row(s) in 0.0580 seconds
In Get mode, data is queried based on the row key. For example if we want to get related information for student '20224' :
hbase(main):061:0> get 'students','20224'
COLUMN CELL
grade:math timestamp=1662387264965, value=15
grade:physics timestamp=1662387265005, value=14
grade:science timestamp=1662387265054, value=13
info:age timestamp=1662387272219, value=18
info:gender timestamp=1662387272084, value=female
info:name timestamp=1662387272023, value=layla
6 row(s) in 0.0580 seconds
we can also fetch for specific column for student '20224':
hbase(main):062:0> get 'students','20224','info:name'
COLUMN CELL
info:name timestamp=1662387272023, value=layla
1 row(s) in 0.0200 seconds
we can also query data by specifying some criteria. Lets get only student names , but limit the result to only 2 rows:
hbase(main):064:0> scan 'students',{STARTROW=>'20221', 'LIMIT'=>2, COLUMNS=>'info:name'}
ROW COLUMN+CELL
20221 column=info:name, timestamp=1662385287819, value=ahmed
20222 column=info:name, timestamp=1662387271561, value=sara
2 row(s) in 0.0360 seconds
HBase can store multiples versions for the same row. In our case we will put (update) the student '20221' name, to have multiples versions:
hbase(main):065:0> put 'students','20221','info:name','Ahmed Ali'
0 row(s) in 0.0160 seconds
hbase(main):066:0> put 'students','20221','info:name','Ahmed Sghir'
0 row(s) in 0.0130 seconds
hbase(main):067:0> put 'students','20221','info:name','Ahmed Wald Nabil'
0 row(s) in 0.0150 seconds
if you query the student '20221' name:
hbase(main):069:0> get 'students','20221','info:name'
COLUMN CELL
info:name timestamp=1662390375880, value=Ahmed Wald Nabil
1 row(s) in 0.0110 seconds
as you can see, you only get the latest update, the same behavior as in relational databases. however, this is not what we want to !? we want to get all the previous updates (called versions in HBase) as well and this is what HBase is god on it.
You can set VERSIONS to specify the number of versions (updates to be queried:
hbase(main):070:0> get 'students','20221',{'COLUMNS'=>'info', VERSIONS=>5}
COLUMN CELL
info:age timestamp=1662385287949, value=22
info:gender timestamp=1662385287894, value=male
info:name timestamp=1662390375880, value=Ahmed Wald Nabil
3 row(s) in 0.0190 seconds
The version is specified during the search, but the last record is still displayed. Although VERSIONS is added, only one record is returned after the get operation. This is because the default value of VERSIONS is 1 during table creation. To verify that execute the next command and check VERSIONS property:
hbase(main):071:0> desc 'students'
Table students is ENABLED
students
COLUMN FAMILIES DESCRIPTION
{NAME => 'grade', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
{NAME => 'info', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
2 row(s) in 0.0810 seconds
To view data of multiple versions, run the following statement to change the value of VERSIONS of the table ( or you can specify the value when creating the table as well):
hbase(main):072:0> alter 'students', {NAME=>'info', 'VERSIONS'=>5}
Updating all regions with the new schema...
0/1 regions updated.
1/1 regions updated.
Done.
0 row(s) in 3.1770 seconds
Then, insert multiple data record updates to student '20221' names:
hbase(main):073:0> put 'students','20221','info:name','Ahmed Ali'
0 row(s) in 0.0230 seconds
hbase(main):074:0> put 'students','20221','info:name','Ahmed Ben Ali'
0 row(s) in 0.0180 seconds
hbase(main):075:0> put 'students','20221','info:name','Ahmed Wald Ali'
0 row(s) in 0.0240 seconds
now to get the last 5 names versions for student '20221', we just run get command as in the next command:
hbase(main):076:0> get 'students','20221',{'COLUMNS'=>'info', VERSIONS=>5}
COLUMN CELL
info:age timestamp=1662385287949, value=22
info:gender timestamp=1662385287894, value=male
info:name timestamp=1662391886964, value=Ahmed Wald Ali
info:name timestamp=1662391885641, value=Ahmed Ben Ali
info:name timestamp=1662391882289, value=Ahmed Ali
info:name timestamp=1662390375880, value=Ahmed Wald Nabil
info:name timestamp=1662385287819, value=ahmed
7 row(s) in 0.0350 seconds
if you want, you can delete specific single row field, for example lets check the age of student '20221':
hbase(main):078:0> get 'students','20221', 'info:age'
COLUMN CELL
info:age timestamp=1662385287949, value=22
1 row(s) in 0.0460 seconds
let’s delete this age value, that correspond to 'Ahmed' student:
hbase(main):079:0> delete 'students','20221','info:age'
0 row(s) in 0.0590 seconds
if we check again, the age value is gone, and it is no longer in the database:
hbase(main):080:0> get 'students','20221', 'info:age'
COLUMN CELL
0 row(s) in 0.0110 seconds
hbase(main):081:0> get 'students','20221'
COLUMN CELL
grade:math timestamp=1662387264427, value=17
grade:physics timestamp=1662387264488, value=18
grade:science timestamp=1662387264533, value=16
info:gender timestamp=1662385287894, value=male
info:name timestamp=1662391886964, value=Ahmed Wald Ali
5 row(s) in 0.0310 seconds
we can also delete all records of a specific row as well, using 'deleteall' command:
hbase(main):082:0> deleteall 'students','20221'
0 row(s) in 0.0150 seconds
If you want to delete a table or change its settings, as well as in some other situations, you need to disable the table first, using the disable command :
hbase(main):083:0> disable 'students'
0 row(s) in 2.4950 seconds
N.B: you can use 'enable' command in the same way if just change some settings.
To drop (delete) a table, use the drop command:
hbase(main):084:0> drop 'students'
0 row(s) in 1.2880 seconds
in the same way, you can drop a namesapace if you are no longer using it:
This command drop_namespace is used to drop the namespace . You can drop only empty namespace, so you must drop all the tables created in that namespace before attempting to drop it.
as our namespace 'class1' is empty, we can run the next command to do so:
hbase(main):085:0> drop_namespace 'class1'
0 row(s) in 1.3680 seconds
by the end of this demo, you now can use different commands to manipulate data in HBae, including create, put, get, scan, delete drop and many others.