20 essential Hadoop tools for crunching Big Data

16th Jul `14, 05:34 PM in Hadoop

1. Hadoop Distributed File System The Hadoop Distributed File System (HDFS) is designed to store very large data…

Partha Sarathi Contributor

1. Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 40 petabytes of enterprise data at Yahoo.


a. Rack awareness allows consideration of a node’s physical location, when allocating storage and scheduling tasks
b. Minimal data motion. MapReduce moves compute processes to the data on HDFS and not the other way around. Processing tasks can occur on the physical node where the data resides. This significantly reduces the network I/O patterns and keeps most of the I/O on the local disk or within the same rack and provides very high aggregate read/write bandwidth.
c. Utilities diagnose the health of the files system and can rebalance the data on different nodes
d. Rollback allows system operators to bring back the previous version of HDFS after an upgrade, in case of human or system errors
e. Standby NameNode provides redundancy and supports high availability
f. Highly operable. Hadoop handles different types of cluster that might otherwise require operator intervention. This design allows a single operator to maintain a cluster of 1000s of nodes.

2. Hbase

HBase is a column-oriented database management system that runs on top of HDFS. It is well suited for sparse data sets, which are common in many big data use cases. Unlike relational database systems, HBase does not support a structured query language like SQL; in fact, HBase isn’t a relational data store at all. HBase applications are written in Java much like a typical MapReduce application. HBase does support writing applications in Avro, REST, and Thrift.


a. Linear and modular scalability.
b. Strictly consistent reads and writes.
c. Automatic and configurable sharding of tables
d. Automatic failover support between RegionServers.
e. Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
f. Easy to use Java API for client access.
g. Block cache and Bloom Filters for real-time queries.
h. Query predicate push down via server side Filters


The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX.


a. Indexing to provide acceleration, index type including compaction and Bitmap index as of 0.10, more index types are planned.
b. Different storage types such as plain text, RCFile, HBase, ORC, and others.
c. Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during query execution.
d. Operating on compressed data stored into Hadoop ecosystem, algorithm including gzip, bzip2, snappy, etc.
e. Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by built-in functions.
f. SQL-like queries (Hive QL), which are implicitly converted into map-reduce jobs.

4. Sqoop

Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.


a. Connecting to database server
b. Controlling parallelism
c. Controlling the import process
d. Import data to hive
e. Import data to Hbase

5. Pig

Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. At the present time, Pig’s infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig’s language layer currently consists of a textual language called Pig Latin


a. Ease of programming.
b. It is trivial to achieve parallel execution of simple, “embarrassingly parallel” data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
c. Optimization opportunities.
d. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
e. Extensibility. Users can create their own functions to do special-purpose processing.

6. ZooKeeper

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed


a. Fast. ZooKeeper is especially fast with workloads where reads to the data are more common than writes. The ideal read/write ratio is about 10:1.
b. Reliable. ZooKeeper is replicated over a set of hosts (called an ensemble) and the servers are aware of each other. As long as a critical mass of servers is available, the ZooKeeper service will also be available. There is no single point of failure.
c. Simple. ZooKeeper maintain a standard hierarchical name space, similar to files and directories.
d. Ordered. The service maintains a record of all transactions, which can be used for higher-level abstractions, like synchronization primitives.


Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.The original intention has been modern web-scale databases.


a. Simple data model using key-value pairs with secondary indexes
b. Simple programming model with ACID transactions, tabular data models, and JSON support
c. Application security with authentication and session-level SSL encryption
d. Integrated with Oracle Database, Oracle Wallet, and Hadoop
e. Geo-distributed data with support for multiple data centers
f. High availability with local and remote failover and synchronization
g. Scalable throughput and bounded latency

8. Mahout

Apache Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop  and using the MapReduce paradigm. Machine learning is a discipline of artificial intelligence focused on enabling machines to learn without being explicitly programmed, and it is commonly used to improve future performance based on previous outcomes.


a. Collaborative filtering: mines user behavior and makes product recommendations (e.g. Amazon recommendations)
b. Clustering: Takes items in a particular class (such as web pages or newspaper articles) and organizes them into naturally occurring groups, such that items belonging to the same group are similar to each other
c. Classification: learns from existing categorizations and then assigns unclassified items to the best category

9. Lucene/Solr

There is but one tool for indexing large blocks of unstructured text, and it’s a natural partner for Hadoop. Written in Java, Lucene integrates easily with Hadoop, creating one big tool for distributed text management. Lucene handles the indexing; Hadoop distributes queries across the cluster.


a. Advanced Full-Text Search Capabilities
b. Optimized for High Volume Web Traffic
c. Standards Based Open Interfaces – XML, JSON and HTTP
d. Comprehensive HTML Administration Interfaces
e. Server statistics exposed over JMX for monitoring
f. Linearly scalable, auto index replication, auto failover and recovery

10. Avro

Avro provides a convenient way to represent complex data structures within a Hadoop MapReduce job. Avro data can be used as both input to and output from a MapReduce job, as well as the intermediate format. The example in this guide uses Avro data for all three, but it’s possible to mix and match; for instance, MapReduce can be used to aggregate a particular field in an Avro record.


a. Near Real-time indexing
b. Flexible and Adaptable with XML configuration
c. Extensible Plugin Architecture
d. Frequent itemset mining – analyzes items in a group (e.g. items in a shopping cart or terms in a query session) and then identifies which items typically appear together

11. Oozie

Apache Oozie is a Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work. It is integrated with the Hadoop stack and supports Hadoop jobs for Apache MapReduce, Apache Pig, Apache Hive, and Apache Sqoop. It can also be used to schedule jobs specific to a system, like Java programs or shell scripts

There are two basic types of Oozie jobs:

Oozie Workflow jobs are Directed Acyclical Graphs (DAGs), specifying a sequence of actions to execute. The Workflow job has to wait. Oozie Coordinator jobs are recurrent Oozie Workflow jobs that are triggered by time and data availability. Oozie Bundle provides a way to package multiple coordinator and workflow jobs and to manage the lifecycle of those jobs


a. Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
b. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
c. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.
d. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).
e. Oozie is a scalable, reliable and extensible system.

12. GIS tools

The world is a big place and working with geographic maps is a big job for clusters running Hadoop. The GIS (Geographic Information Systems) tools for Hadoop project has adapted some of the best Java-based tools for understanding geographic information to run with Hadoop. Your databases can handle geographic queries using coordinates instead of strings. Your code can deploy the GIS tools to calculate in three dimensions. The trickiest part is figuring out when the word “map” refers to a flat
thing that represents the world and when “map” refers to the first step in a Hadoop job


a. Run Filter and aggregate operations on billions of spatial data records inside Hadoop based on spatial criteria.
b. Define new areas represented as polygons, and run Point in Polygon analysis on billions of spatial data records inside Hadoop.
c. Visualize analysis results on a map with rich styling capabilities, and a rich set of base maps.
d. Integrate your maps in reports, or publish them as map applications online.

13. Flume

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.


a. New in-memory channel that can spill to disk
b. A new dataset sink that use Kite API to write data to HDFS and HBase
c. Support for Elastic Search HTTP API in Elastic Search Sink
d. Much faster replay in the File Channel.

14. Clouds

Many of the cloud platforms are scrambling to attract Hadoop jobs because they can be a natural fit for the flexible business model that rents machines by the minute. Companies can spin up thousands of machines to crunch on a big data set in a short amount of time instead of buying permanent racks of machines that can take days or even weeks to do the same calculation. Some companies, such as Amazon, are adding an additional layer of abstraction by accepting just the JAR file filled with software routines. Everything else is set up and scheduled by the cloud.


a. Data storage services to capture, analyze and access data in any format
b. Data management services to process, monitor and operate Hadoop
c. Data platform services to secure, archive and scale for consistent availability

15. Spark

Apache Spark is an open-source data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). However, Spark is not tied to the two-stage MapReduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce for certain applications.Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster’s memory and query it repeatedly, making it well suited to machine learning algorithms


a. Proven scalability to 100 nodes in the research lab and 80 nodes in production at Yahoo
b. Ability to cache datasets in memory for interactive data analysis: extract a working set, cache it, query it repeatedly.
c. Interactive command line interface (in Scala or Python) for low-latency data exploration at scale.
d. Higher level library for stream processing, through Spark Streaming.
e. Higher level libraries for machine learning and graph processing that because of the distributed memory-based Spark architecture are ten times as fast as Hadoop disk-based Apache Mahout and even scale better than Vowpal Wabbit

16. Ambari

The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.


a. Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.
b. Ambari leverages Ganglia for metrics collection.
c. Ambari leverages Nagios for system alerting and will send emails when your attention is needed (e.g., a node goes down, remaining disk space is low, etc)

17. Map reduce

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs’ component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.


a. Scale-Out Architecture – Add servers to increase capacity
b. High Availability – Serve mission-critical workflows and applications
c. Fault Tolerance – Automatically and seamlessly recover from failures
d. Flexible Access – Multiple and open frameworks for serialization and file system mounts
e. Load Balancing – Place data intelligently for maximum efficiency and utilization
f. Tunable Replication – Multiple copies of each file provide data protection and computational performance
g. Security – POSIX-based file permissions for users and groups with optional LDAP integration

18. SQL on Hadoop

If you want to run a quick, ad-hoc query of all of that data sitting on your huge cluster, you could write a new Hadoop job which would take a bit of time. After programmers started doing this too often, they started pining for the old SQL databases, which could answer questions when posed in that relatively simple language of SQL. They scratched that itch, and now there are a number of tools emerging from various companies. All offer a faster path to answers.

19. Impala

Cloudera Impala is the industry’s leading massively parallel processing (MPP) SQL query engine that runs natively in Apache Hadoop. The Apache-licensed, open source Impala project combines modern, scalable parallel database technology with the power of Hadoop, enabling users to directly query data stored in HDFS and Apache HBase without requiring data movement or transformation. Impala is designed from the ground up as part of the Hadoop ecosystem and shares the same flexible file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other components of the Hadoop stack.


a. Performance equivalent to leading MPP databases, and 10-100x faster than Apache Hive/Stinger.
b. Faster time-to-insight than traditional databases by performing interactive analytics directly on data stored in Hadoop without data movement or predefined schemas.
c. Cost savings through reduced data movement, modeling, and storage.
d. More complete analysis of full raw and historical data, without information loss from aggregations or conforming to fixed schemas.
e. Familiarity of existing business intelligence tools and SQL skills to reduce barriers to adoption.
f. Security with Kerberos authentication, and role-based authorization through the Apache Sentry project.
g. Freedom from vendor lock-in through the open source Apache license

20. MongoDB

A MongoDB deployment hosts a number of databases. A database holds a set of collections. A collection holds a set of documents. A document is a set of key-value pairs. Documents have dynamic schema. Dynamic schema means that documents in the same collection do not need to have the same set of fields or structure, and common fields in a collection’s documents may hold different types of data

MongoDB Features

a. Flexibility: MongoDB stores data in JSON documents (which we serialize to BSON). JSON provides a rich data model that seamlessly maps to native programming language types, and the dynamic schema makes it easier to evolve your data model than with a system with enforced schemas such as a RDBMS.
b. Power: MongoDB provides a lot of the features of a traditional RDBMS such as secondary indexes, dynamic queries, sorting, rich updates, upserts (update if document exists, insert if it doesn’t), and easy aggregation. This gives you the breadth of functionality that you are used to from an RDBMS, with the flexibility and scaling capability that the non-relational model allows.
c. Speed/Scaling: By keeping related data together in documents, queries can be much faster than in a relational database where related data is separated into multiple tables and then needs to be joined later. MongoDB also makes it easy to scale out your database. Autosharding allows you to scale your cluster linearly by adding more machines. It is possible to increase capacity without any downtime, which is very important on the web when load can increase suddenly and bringing down the website for extended maintenance can cost your business large amounts of revenue.