Ayadi Tahar | Columnar Databases: HBase Data Overview

Columnar Databases: HBase Data Overview

Publish Date: 2022-09-13


Hadoop was initially built for batch jobs in big data, it was not created (at least initially) for reading/writing of large datasets. that because all operations using the Map Reduce I/O framework classes, do the read/write in sequences. Hadoop lacks features for updating single records in large datasets because it was intended for write-once read-many times kind of workloads.

So back then for large clusters working for batch jobs, there isn’t a way to use these clusters for fast random access read/write applications, and that was a limitation which Hbase come to solve.

What is Hbase

Hbase is a distributed wide-column database that runs primarily on a Hadoop cluster, offering very low latency, random access, and extensive read/write operations over large datasets. it is called the Hadoop dataBase.

HBase is modeled after a Google's paper Bigtable: A Distributed Storage System for Structured Data . and by the way, google BigTable is the database that powers many google services like Search, YouTube, Analytics, Maps, Gmail …etc. So HBase is an implementation of that paper.

Hbase is linearly scalable and provides fast lookups and rich queries over millions and even billions of records. Because it stores its data on the HDFS filesystem, HBase benefits from its features like data reliability, redundancy, scalability, and fault tolerance.

Hbase does not support SQL natively (you can with Phoenix if you want), but supports Java, Rest, and Thrift API for client access. Hbase can be integrated with other tools in the big data ecosystem for operational or analytical workloads.

if you already run a Hadoop cluster, and you need a super fast database that can deliver consistent read/write operations over large record sets, then this is a good reason to use HBase.

Terminology

because Columnar databases components are a little different than in relational databases, let's look into it to get more familiar with it.

Namespace

is akin to a database or schema in relational databases, it provides a logical grouping of tables for an application.

so with namespace, we can do general administrative tasks like security and user management for a group of tables, instead of each table individually.

Table

Table is the storage structure for HBase, and when defining a table, you are required to provide at least, a name and a column family, via Data Definition Language (DDL) or Java API. You can alter the table later on and add, remove or change column families and table properties. specifying column names is not required, because the concept of column names does not hold in HBase.

Row

Tables are a collection of rows, and you can think of HBase rows as key-value pairs. a key is called a row key (the same as the primary key or partition key) and a value is one or more column families. Row keys are lexicographically sorted in a table, that’s why scan and get operations can be fast (more on that shortly).

In the world of HBase, the term row and row key are used interchangeably.

Column family

The value of a row can be one or more column families. a column family is a collection of columns or cells. they are defined upfront during and after table definition. and even in a single table different column families can have different configurations. The general rule is that you created a column family for a related field in a domain.

Cell

as we know, a column family is a collection of columns or cells. For a cell, it is the smallest storage unit of a table, that can be addressed by a combination of a row key, column family, a cell name, and timestamp

the value is a bunch of byte-arrays. By the way, the names of column families, the values of row keys, and the values of data itself are stored as byte arrays. This is because HBase does not support data types. So the conversion of data between the HBase database and the application space is the responsibility of the developer who writes the application.

Timestamp and Versions

The purpose of a timestamp in the cell is to enable versioning on the same cell of a row. So, if we have done multiple changes/updates, we would expect different versions for values in that cell, and for reading operations, only the value with the latest timestamp will be returned. However, you can query for previous versions of data in that cell as well.

Example:

assume you have a students records, you will have a 1-to-1 relationship between the student academic summary and student bio-information. in that case you can have 2 different column families in a single student table. even though the read/writes access for these 2 column families are different, all reside in a single table.

you can hover over the table highlighted elements to get familiar with different component names in Hbase:

info grade
name gender age math physics science
20221 ahmed male 22 17 15 16
20222 sara female 20 14 16 15
20223 omar male 25 18 17 12
20224 layla female 18 15 14 13

final thoughts

HBase is a column-oriented database and the tables in it are sorted by row.

The table schema defines only column families, which are the key value pairs.

A table have multiple column families and each column family can have any number of columns.

Resources: