Data Science

9 must watch big data technologies along with Hadoop!

18th Apr `16, 10:19 AM in Data Science

Today emerging big data technology firm focused on helping enterprises build breakthrough software solutions powered by disruptive enterprise…

Kumar Chinnakali
Kumar Chinnakali Contributor

Today emerging big data technology firm focused on helping enterprises build breakthrough software solutions powered by disruptive enterprise software trends like Machine learning and data science, Cyber-security, Enterprise IOT, and Cloud.

So Hadoop is one of the proven software in big data space, but is it only Hadoop. Nope we have many more technologies which solve our big data problem. Just to have quick insights on the top 9 must watch big data technologies along with Apache Hadoop.

  1. Apache Flink
  2. Apache Samza
  3. Google Cloud Data Flow
  4. StreamSets
  5. Tensor Flow
  6. Apache NiFi
  7. Druid
  8. LinkedIn WhereHows
  9. Microsoft Cognitive Services

Apache Flink: Apache Flink is community-driven open source framework for distributed big data analytics like Apache Hadoop and Apache Spark. Its engine exploits data streaming and in-memory processing and iteration operators to improve performance.  It has origins in a research project called Stratosphere of which the idea was conceived in 2008 by Professor Volker Markl – Technische University, Germany. And in German, Flink means agile or swift. Today Apache Flink is Top Level Project (TLP), joined the Apache incubator in April 2014 with many contributors across globe.


Flink inspired from both MPP database technology (Declaratives, Query Optimizer,  Parallel in-memory, out-of-core algorithms)  and Hadoop MapReduce technology(Massive scale out, User Defined functions, Schema on Read)  and by adding its own flavors(Streaming, Iterations, Dataflow, General APIs). For further reading…

Apache Samza: It is created by LinkedIn to address extend the capabilities of Apache Kafka and it has features like Simple API, Managed state, Fault Tolerant, Durable messaging, Scalable, Extensible, and Processor Isolation.


And the Samza code runs as a Yarn job, and we can implement the StreamTask interface, which defines a process() call. StreamTask runs inside a task instance, which itself is inside a Yarn container. For further reading….

Cloud Dataflow: The Dataflow is native Google Cloud data processing service with simple programming model for batch and streamed data processing tasks. It provides a data flow managed service to control the execution of data processing jobs and data processing jobs can be authored using the Data Flow SDKs (Apache Beam).

Google Cloud data platform

Google Data Flow provides management, monitoring and security capabilities in data pipelines. Sources and Sink abstract read and write operations in a pipeline and a pipeline encapsulates an entire series of computations that accepts some input data from external sources, transforms that data produces some output data. For further reading.

StreamSets: The StreamSets is data processing platform optimized for data in motion and it has visual data flow authoring model, which is open source distribution model. It can be deployed on premise and cloud distributions with rich monitoring and management interfaces.

Apache Stream sets

The data collectors streams and process data in real time using data pipelines and a pipeline describes a data flow from origin to destination which is composed of origins, destinations and processors. And the lifecycle of a data collector can be controlled via the administration console. For further details. 

TensorFlow: It is a second generation Machine Learning system, followed by DistBelief. And TensorFlow grew out of a project at Google, called Google Brain, aimed at applying various kinds of neural network machine learning to products and services across the company. It is a  open source software library for numerical computation using data flow graphs used in following projects at Google – DeepDream, RankBrain, Smart Reply, and many more.

Tensor flow

Data flow graphs describe mathematical computation with a directed graph of nodes & edges. And nodes in the graph represent mathematical operations; edges represent the multidimensional data arrays (tensors) communicated between them. Edges describe the input/output relationships between nodes. The flow of tensors through the graph is where TensorFlow gets its name. For further Reading.

Druid: Druid was started in 2011. And it has features like power interactive data applications, multi-tenancy: lots of concurrent users, scalability: trillions events/day, sub-second queries,  real-time analysis. And the Druid has special key features like low latency ingestion, fast aggregations, arbitrary slice-n-dice capabilities, highly available, approximate and exact calculations.


It has details like Real-time node, Historical node, Broker node, Coordinator node, Indexing service with JSON based query language. For further reading…

Apache NiFi: The Apache NiFi is ppowerful and reliable system to process and distribute data. It directed graphs of data routing and transformation. It is web-based User Interface for creating, monitoring, & controlling data flows which is highly configurable – modify data flow at runtime, dynamically prioritize data. And it also provides data provenance tracks data through entire system. It is easily extensible through development of custom components.

Apach Nifi

Apache NiFi works based on the following concepts like FlowFile, Processor, and Connection. And for further reading…

LinkedIn WhereHows: The WhereHows provides answers to where is my data, how did it get there by having enterprise catalog with metadata search. It provides collaboration, data lineage analysis, connectivity to many data sources and etl tools. And it has special feature like powering LinkedIn data discovery layer.

LinkedIn Wherehows

It has web interface for data discovery, API enabled with back end server that controls the metadata crawling and integration with other systems. And for further reading…

Microsoft Cognitive Services: It is based on Project Oxford and Bing, which offers 22 cognitive computing APIs. And the main categories include; Vision, Speech, Language, Knowledge, and Search. It is integrated with Cortana Intelligence Suite.

Microsoft cognitive services

And it is open source with 22 different REST APIs that abstract cognitive capabilities and the developer can use SDKs for Windows, IOS, Android and Python for development activities.For further details…

To conclude the big data ecosystem is constantly evolving and there are lot of relevant new technologies beyond the traditional Hadoop-Spark stacks are available and big internet companies and big data startups are leading innovation in the space.

Originally appeared on DataOttam.