Developers do big data processing with Hadoop in a variety of cloud environments, but Google is trying to stand apart through investments in the storage subsystem. With a new connector, it is now possible for Hadoop to run directly against Google Cloud Storage instead of using the default, distributed file system. This results in lower storage costs, fewer data replication activities, and a simpler overall process.
Google has played a role in Hadoop ever since unveiling research on the Google File System in 2003 and MapReduce in 2004. Google’s commercial Hadoop service, called Hadoop on Google Cloud Platform, is described as a “a set of setup scripts and software libraries optimized for Google infrastructure.” Hadoop jobs typically use storage based on local disks running the Hadoop Distributed File System (HDFS), but this new Google Cloud Storage Connector for Hadoop attempts to simplify that. The connector makes it possible to run Hadoop directly against Google Cloud Storage instead of copying data to a HDFS store. Google Cloud Storage – based on Google latest storage technology called Colossus – is a highly-available object storage service for storing large amounts of data in US and EU data centers.