Hadoop

What is the role of RDDs in Apache Spark? – Part 1

This blog introduces Spark’s core abstraction for working with data, the RDD (Resilient Distributed Dataset). An RDD is simply a distributed collection of elements or objects (Java, Scala, Python, and user defined functions) across the Spark cluster. In Spark, all work is expressed in three ways as follows,

  • Creating new RDDs
  • Transforming existing RDDs
  • Calling operations on RDDs to compute a result

RDD Foundations

RDD in Spark is simple an immutable distributed collection of objects, each split into multiple partitions. We create RDDs in two ways as like,

  • By loading an external dataset
  • By distributing a collection of objects in their driver program

Once the RDDs are created it offers two types of operations such as,

  • Transformations
  • Actions

Transformations construct a new RDD from a previous one [filter, map, groupBy] and Actions on other hands compute result based on an RDD either it return to driver program or save it to an external storage system (HDFS, S3, Cassandra, HBase, etc.,) [first, count, collect, save].

Transformations and actions are different because of the way Spark computes RDDs, as we can able to define the new RDDs any time. Spark computes them only in a lazy fashion that is nothing but when used first time in action.

Finally, the RDDs are by default recomputed each time when we run an action on them, if we want to use multiple times then we can ask Spark to persist by using RDD.persist().

RDD presist

Listed are the number of ways and options to use for persisting RDD in Spark and if we wanted to replicate the data on two machines then we need to add _2 at the end of storage level. In production practices, we will often use persist () to load the subset of the data into memory which could be query frequently. And, the cache() is the same calling persists() with default storage level.

Just to summarize every Spark program will work as follows,

  • Create some input RDDs from external data
  • Transform RDDs to define new RDDs using transformations like filter()
  • Use persist () to persist an intermediate RDDs which will be reused
  • Launch actions such as count(), first() to kick-start the parallel computation

And to conclude RDDs are Immutable, portioned collections of objects spread across a cluster, stored in RAM or on disk, built through lazy parallel transformations, and automatically rebuilt on failure. In Part 2 – we will be sharing internal details of Transformations & Actions of RDDs and benefits of Lazy Evaluation.

Reference – Big Data Analytics Community, Learning Spark: Karau, Konwinski, Wendell, Zaharia.

Interesting?

This article originally appeared here. Republished with permission. Submit your copyright complaints here.

13 Comments
  1. Miguelvulky 4 months ago
    Reply

    10 Best Cryptocurrency to Invest in 2019: http://www.vkvi.net/bestinvestcryptobitcoin16943

  2. WilliamMeave 4 months ago
    Reply

    Bezahlte Umfragen: Verdienen Sie € 3.000 oder mehr pro Woche: http://www.abcagency.se/bestinvestcryptobitcoin73816

  3. RobertDrado 3 months ago
    Reply

    The Beginner’s Guide to Getting More High Quality Traffic: https://arill.us/getmoretraffic17338

  4. JavierDed 3 months ago
    Reply

    $10000 per day Binary Options Brokers Accepting Bitcoin Payments: http://rih.co/15000investbinarycrypto66936

  5. WilliamMeave 3 months ago
    Reply

    Comment gagner 10 000 € par jour FAST: http://corta.co/bestinvestcrepto25827

  6. ClarkDup 2 months ago
    Reply

    If you invested $1,000 in bitcoin in 2011, now you have $4 million: http://to.ht/investbitcoin73310

  7. WilliamMeave 2 months ago
    Reply

    If you invested $1,000 in bitcoin in 2011, now you have $4 million: https://aaa.moda/investbitcoin42238

  8. RobertDrado 2 months ago
    Reply

    If you invested $1,000 in bitcoin in 2011, now you have $4 million: http://rih.co/investmining40102

  9. JavierDed 1 month ago
    Reply

    Wie man in bitcoins $ 5000 investiert – erzielt eine Rendite von bis zu 2000%: https://aaa.moda/bestinvest78310

  10. Get $1000 – $6000 A Day http://cutt.us/lU3KD8Fy 3 weeks ago
    Reply

    Get $1500 – $6000 per DAY: http://cutt.us/lSTYRMe

  11. LonnieWhido 2 weeks ago
    Reply

    Wie man € 3000 schnell macht Schnelles Geld | Der beschaftigte Budgeter: http://tinyurl.com/y57ozsvo

  12. Only for Australians. Invest $ 5,000 and get from $ 15,000 per week http://qps.ru/zIVeu 2 weeks ago
    Reply

    The Best Ways to Invest for Australians: http://prewontific.tk/et93

  13. RobertDrado 1 week ago
    Reply

    Natural Stress Solutions CBD Capsules Morning (Energy): http://rih.co/qwe19914

Leave a Comment

Your email address will not be published.

You may also like

Pin It on Pinterest