If you are looking for some of the most influential research papers that revolutionised the way how we gather, aggregate, analyze and store increasing volumes of data in a short span of 10 years, you are in the right place! These papers were shortlisted, based on recommendations by big data enthusiasts and experts around the globe from various social media channels. In case we’ve missed out any important paper, please let us know.
This paper presents MapReduce, a programming model and its implementation for large-scale distributed clusters. The main idea is to have a general execution model for codes that need to process a large amount of data over hundreds of machines.
It presents Google File System, a scalable distributed file system for large distributed data-intensive applications, which provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.
Chubby is a distributed lock service; it does a lot of the hard parts of building distributed systems and provides its users with a familiar interface (writing files, taking a lock, file permissions). The paper describes it, focusing on the API rather than the implementation details.
This paper describes the design and initial implementation of Chukwa, a data collection system for monitoring and analyzing large distributed systems. Chukwa is built on top of Hadoop, an open source distributed filesystem and MapReduce implementation, and inherits Hadoop’s scalability and robustness.
Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure.
There are two schools of thought regarding what technology to use for data analysis. Proponents of parallel databases argue that the strong emphasis on performance and efficiency of parallel databases makes them well-suited to perform such analysis. On the other hand, others argue that MapReduce-based systems are better suited due to their superior scalability, fault tolerance, and flexibility to handle unstructured data. This paper explores the feasibility of building a hybrid system.
This paper outlines the S4 architecture in detail, describes various applications, including real-life deployments, to show that the S4 design is surprisingly flexible and lends itself to run in large clusters built with commodity hardware.
This paper describes the architecture and implementation of Dremel, a scalable, interactive ad-hoc query system for analysis of read-only nested data, and explains how it complements MapReduce-based computing.
Percolator is a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. This indexing system based on incremental processing replaced Google’s batch-based indexing system.
It explains about Spanner, Google’s scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and sup-port externally-consistent distributed transactions.
Shark is a research data analysis system built on a novel coarse-grained distributed shared-memory abstraction. Shark marries query processing with deep data analysis, providing a unified system for easy data manipulation using SQL and pushing sophisticated analysis closer to data.
This paper summarizes twelve key lessons that machine learning researchers and practitioners have learned, which include pitfalls to avoid, important issues to focus on, and answers to common questions.
This paper describes a method of building a forest of uncorrelated trees using a CART like procedure, combined with randomized node optimization and bagging. In addition, it combines several ingredients, which form the basis of the modern practice of random forests.
The paper focuses on developing a general and exact technique for parallel programming of a large class of machine learning algorithms for multicore processors. The central idea is to allow a future programmer or user to speed up machine learning applications by “throwing more cores” at the problem rather than search for specialized optimizations.
This paper focuses on applications that reuse a working set of data across multiple parallel operations and proposes a new framework called Spark that supports these applications while retaining the scalability and fault tolerance of MapReduce.
This paper presents Twitter’s production logging infrastructure and its evolution from application-specific logging to a unified “client events” log format, where messages are captured in common, well-formatted, flexible Thrift messages.
F1 is a distributed relational database system built at Google to support the AdWords business. F1 is a hybrid database that combines high availability, the scalability of NoSQL systems like Bigtable, and the consistency and usability of traditional SQL databases.
This paper summarizes the insights of the Eighteenth Annual Roundtable on Information Technology, which sought to understand the implications of the emergence of “Big Data” and new techniques of inferential analysis.