Here is another interesting use case that came up when I was working with one of our clients in the insurance industry. The client had enormous amount of claim data residing in multiple databases in SQL Server which were to be consolidated into one. Some of the queries on this data took days because of which we were looking for an alternate solution that could process data in a distributed fashion and save us some time. We started looking into a Hadoop based solution since the company was already using Hadoop.
We had few options on the table such as Hive, Pig, Hbase etc and after some brainstorming decided to go with HBase for the following reasons:
It is an open source distributed database which would yield higher performance while being cost effective at the same time.
We do not have to worry about distributing the data for faster processing since Hadoop takes care of it.
Batch processing with no real indexes.
Data integrity as HBase confirms a write after its write-ahead log reaches all the three in-memory HDFS replicas.
Easily scalable, fault tolerant and highly available.
Now the next step was to move data from the SQL database to HDFS for which we used Sqoop. It imports all the data, stores it in CSV by default and can be used as:
The next step was to create an HBase table and insert all the data from the CSV into it in the following ways:
hadoop jar <Path To HBase Jar> importtsv -Dimporttsv.columns=<Column Names> ‘-Dimporttsv.separator=,’ <Table To Import Into> <Input Directory>
Or use the complete bulk loader:
hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] <Input Directory> <Table To Import Into>
We gained everything we had hoped for by moving to HBase but there was something still missing. Querying an HBase database was not everyone’s cup of tea and hence the process needed to be simplified. Since the insurance folks were already familiar with SQL the easiest way out was to build a Hive schema on top of the HBase table.
There can be two cases while creating a Hive table on top of HBase:
We do not know the column names or need all the columns for which we could explode all the data into a map as key value pairs.
We need only specific columns in which case we need to specify the mappings for every column.
or map every column by name
CREATE EXTERNAL TABLE test_map(id string,colname1 string,colname2 string)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (“hbase.columns.mapping” = “:key,TEST:mydata1,TEST:mydata5″)
TBLPROPERTIES(“hbase.table.name” = “MY_TABLE”);