What is the role of RDDs in apache Spark? – Part 1

05th Jan `16, 05:03 PM in Hadoop

This blog introduces Spark’s core abstraction for working with data, the RDD (Resilient Distributed Dataset). An RDD is…

Kumar Chinnakali
Kumar Chinnakali Contributor

This blog introduces Spark’s core abstraction for working with data, the RDD (Resilient Distributed Dataset). An RDD is simply a distributed collection of elements or objects (Java, Scala, Python, and user defined functions) across the Spark cluster. In Spark all work is expressed in three ways as follows,

  • Creating new RDDs
  • Transforming existing RDDs
  • Calling operations on RDDs to compute a result

RDD Foundations:

RDD in Spark is simple an immutable distributed collection of objects, each split into multiple partitions. We create RDDs in two ways as like,

  • By loading an external dataset
  • By distributing a collection of objects in their driver program

Once the RDDs are created it offers two types of operations such as,

  • Transformations
  • Actions

Transformations construct a new RDD from a previous one [filter, map, groupBy] and Actions on other hands compute result based on an RDD either it return to driver program or save it to an external storage system (HDFS, S3, Cassandra, HBase, etc.,) [first, count, collect, save].

Transformations and actions are different because of the way Spark computes RDDs, as we can able to define the new RDDs any time. Spark computes them only in a lazy fashion that is nothing but when used first time in action.

Finally the RDDs are by default recomputed each time when we  run an action on them, if we want to use multiple times then we can ask Spark to persist by using RDD.persist().

RDD presist

Listed are the number of ways and options to use for persisting RDD in Spark and if we wanted to replicate the data on two machines then we need to add _2 at the end of storage level. In production practices we will often use persist () to load subset of the data into memory which could be query frequently. And, the cache() is same calling persists() with default storage level.

Just to summarize every Spark program will works as follows,

  • Create some input RDDs from external data
  • Transform RDDs to define new RDDs using transformations like filter()
  • Use persist () to persist an intermediate RDDs which will be reused
  • Launch actions such as count(), first() to kick start the parallel computation

And to conclude RDDs are Immutable, portioned collections of objects spread across a cluster, stored in RAM or on disk, built through lazy parallel transformations, and automatically rebuilt on failure. In Part 2 – we will be sharing internal details of Transformations & Actions of RDDs and benefits of Lazy Evaluation.

Reference – Big Data Analytics Community, Learning Spark: Karau, Konwinski, Wendell, Zaharia.


Originally appeared on dataottam.