In a recent conversation with project team members from a client, one shared an internal slide deck used to promote the benefits of big data (in general) and Hadoop (in particular) among both key management decision makers and the development and implementation groups in IT. One interesting aspect of the presentation was the comparison of Hadoop to earlier computing ecosystems and the casting of the open source distributed processing framework in the role of “operating system” for a big data environment.
At the time the slide deck was assembled, that characterization was perhaps somewhat of a stretch. The core components of the initial Hadoop release were the Hadoop Distributed File System (HDFS) for storing and managing data and an implementation of the MapReduce programming model. The latter included application programming interfaces, runtime support for processing MapReduce jobs and an execution environment that provided the infrastructure for allocating resources in Hadoop clusters and then scheduling and monitoring jobs.
While those components acted as proxies for aspects of an operating system, the framework’s processing capabilities were limited by its architecture, with the JobTracker resource manager and the application logic and data processing layers all combined in MapReduce.