Machine Learning

Tools in the data armoury: R vs Spark

The purpose of this article is twofold. The first is to give a quick comparison between R and Spark, in terms of performance. The second is to introduce you to Spark’s ML library.
Background
As R is inherently single threaded, it may not be wise to compare Spark and R in terms of performance. Though it is not an ideal comparison, some of the numbers below will definitely excite someone who has faced these problems in the past.
Have you thrown a machine learning problem at R and waited for hours? You had to be patient simply because you didn’t have a viable alternative. But it’s time we take a look at Spark ML, which has most of the functionalities of R and is great in terms of scaling and performance.
I once took a specific problem and tried to solve it using different machine learning techniques, using both R and Spark ML as tools. To make the comparison fair, I even used the same hardware and operating system for both of them. And I ran Spark in standalone mode with no cluster configured.
Before we get into the details just a small note about Revolution R. As an enterprise version of R, it attempts to solve the weaknesses of R being single threaded. Getting locked into a proprietary software like Revolution Analytics may not be an ideal long term solution. To add to the acquisition of Revolution Analytics by Microsoft, may complicate things further, in terms of licensing.
Hence community-backed open source tools like Spark will probably be a better option than Revolution R.
Dataset and problem
The dataset taken for this analysis is digit recognizer data set from Kaggle. It contains gray-scale images of hand-drawn digits, from zero through nine.
Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel value associated with it, indicating the lightness or darkness of that pixel, with higher numbers translating to darker pixels. This pixel value is an integer between 0 and 255, inclusive. It has 785 columns including first column, called “label,” is the digit that was drawn by the user.
The goal is to come up with a model that can predict the digits from the pixel data.
The rationale behind choosing this dataset is that it is not really a big data problem, in terms of the volume.
Comparison
Here are some of the machine learning techniques/steps that were applied to this problem, and resulted in predictive model(s),

  • To run PCA and LDA on the data set to come out with princpals components(as a feature engineering step).
  • To run a binary logistic regression for all pairs of digits (45) and to classify them based on the pixel information and the PCA and LDA features.
  • To run Multinomial logistic regression model with the entire data for multi-class classification.
  • To run Naive Bayes classification model for classifying the digits based on pixel information and the PCA and LDA features.
  • To run Decision tree classification model for the digit classification.

Before these steps, I split the labelled data into training and testing data, to train the model and to validate the performance of the model in terms of accuracy.
Most of these steps are taken using both R and Spark. The details of the comparison are provided below for PCA, binary logit models, and Naive Bayes classification models.
Principal component analysis
The main computational complexity of PCA lies in the scoring portion. The logical steps are given below:

  • It learns KxM weightage values by running through the data and the covariance table of different columns, (where K is the number of principal components and M is the number features in the dataset).
  • When we have to score for N observations, it will be a matrix multiplication operation.
  • Between the dataset of dimension NxM and weightages MxK, it will result in NxK principal components. That is K principal components for each of N observations.

In our case, this scoring operation turns out to be a matrix multiplication between the dataset with dimension 42000 x 784 and 784 x 9. When this much of computation was thrown at R, it honestly took more than four hours. Spark took no more than ten seconds to complete the same operation.
PCA spark vs R
This matrix multiplication turns out to be close to 300 million operations/instructions with quite a bit of other indexing and lookup operations as well. It’s amazing that Spark’s parallel computing engine could finish it in ten seconds.
I verified accuracies of the generated principal components by looking at the variances explained by the top nine principal components. The variances matched exactly with the variances of the top nine PCAs generated using R. This ensured that there is no compromise on accuracy done in Spark for performance or scaling.
Logistic regression model
Unlike PCA in the logistic regression model, both training and scoring are computationally, extremely intensive operations. The training portion in the closed form solution involves some transpose and inverse operations on the matrix of the entire dataset.
Due to these computational complexities, both training and scoring took a while to complete. Seven hours to be precise. And Spark took only about five minutes.
R vs Spark logistic regression
Here, I ran the binary logistic regression model for all forty five pairs of digits, from zero to nine.
The scoring/validation was also done for all forty five pairs of test data.
In parallel, I also ran the multinomial logistic regression model as a multi-class classifier, and it took about three minutes to complete. I couldn’t run it with R to provide data points on the comparison.
As in the case of PCA, I used the AUC values for measuring prediction performance of each of the forty five models. The AUC values matched between the models of both Spark and R.
Naive Bayes classifier
Unlike PCA and Logistic Regression, Naive Bayes classifier is not computationally intensive. It involves operations like computing the class prior probabilities and deriving the posterior probabilities based on the additional data available.
spark vs r naive baiyes
As can be seen in the chart above, it took about forty five odd minutes to complete. Spark finished it in 9 seconds.  As in the previous cases, their accuracies matched.
In parallel, I tried to run the decision tree as well, using Spark ML. It took about twenty seconds. I couldn’t run it in R at all.
Spark ML: Getting started
Enough of comparisons and this build up towards Spark ML! The best place to start with Spark ML would be the Programming guide. Though, if you want to try something quickly or learn by practice, you will have a hard time in getting it up and running.
To understand the sample code provided and to experiment with a dataset, you need to first understand the basic structure and operations supported on Spark RDD. You then need to know the different Spark ML programs and start programming on it. By the time you get your first Spark ML program running, your interest would have probably diminished.
Here are two resources that will help avoid this kind of problem and smoothen the learning:

  • The entire source code, for anyone to play around along with R code used for comparison: https://github.com/vivekmurugesan/experiments/tree/master/spark-ml
  • The entire source code for the Docker container that comes with Spark pre-installed and with the binary (jar) of the above project to try out even quicker: https://hub.docker.com/r/vivekmurugesan/spark-hadoop/The Docker container has Apache Hadoop pre-installed and running in a pseudo-distributed mode. It will help put files that are larger in size into HDFS to test with Spark. It is very easy to create an RDD instance in Spark by loading the file from HDFS.

Productivity And Accuracy
People use different yardsticks to compare tools like these. To me, it is the precision and productivity that should decide the tools.
People prefer to use R over Spark ML, due to the steep learning curve. They end up using small samples with R, because R takes a lot of time to process the large samples of data. This affects the performance of the entire process.
To me, using a small sample is never the solution, because small samples can never be representative (at least in most of the cases). Hence, you are compromising on accuracy if you take a small sample.
Once you rule out small samples, it boils down to the issue of productivity. Solving machine learning problems will always be iterative in nature. If each of your iterations is long, then the time to completion adds up. However, if you enjoy the iterations that take a less amount of time, you can afford to spend little more time on writing the initial code.
Conclusion
With the amount of statistical computation packages and visual analysis packages like ggplot2, there is no way R can be done away with. The kind of capabilities it brings to explore data and statistical summarization is unquestionable.
But, when it comes to building a model for massive datasets, we should explore tools like Spark ML. Spark also provides an R package.  SparkR can be used to explore distributed datasets with R.
It is always better to have more tools in your armoury, as you never know what to expect in a war. Hence, I will say that it is time to move past R to Spark ML.
Productivity And Accuracy

People use different yardsticks to compare tools like these. To me, it is the precision and productivity that should decide the tools.

People prefer to use R over Spark ML, due to the steep learning curve. They end up using small samples with R, because R takes a lot of time to process the large samples of data. This affects the performance of the entire process.

To me, using a small sample is never the solution, because small samples can never be representative (at least in most of the cases). Hence, you are compromising on accuracy if you take a small sample.

Once you rule out small samples, it boils down to the issue of productivity. Solving machine learning problems will always be iterative in nature. If each of your iterations is long, then the time to completion adds up. However, if you enjoy the iterations that take a less amount of time, you can afford to spend little more time on writing the initial code.

Conclusion

With the amount of statistical computation packages and visual analysis packages like ggplot2, there is no way R can be done away with. The kind of capabilities it brings to explore data and statistical summarization is unquestionable.

But, when it comes to building a model for massive datasets, we should explore tools like Spark ML. Spark also provides an R package.  SparkR can be used to explore distributed datasets with R.

It is always better to have more tools in your armoury, as you never know what to expect in a war. Hence, I will say that it is time to move past R to Spark ML.

37 Comments
  1. Danieljeale 9 months ago
    Reply

    3 Ways to Last Longer in Bed: http://perkele.ovh/astongerined37281

  2. KevinFlids 9 months ago
    Reply

    The Top 5 Best Cryptocurrencies 2019: http://www.vkvi.net/cryptoinvestbitcoin53095

  3. KevinFlids 9 months ago
    Reply

    Best cryptocurrency to Invest 2019: http://rih.co/bestinvestcryptobitcoin14282

  4. Miguelmossy 9 months ago
    Reply

    Enquetes remunerees: gagnez 3 000 € ou plus par semaine: http://perkele.ovh/bestinvestcryptobitcoin31307

  5. Miguelmossy 8 months ago
    Reply

    $10000 per day Best Bitcoin Binary Options | Crunchbase: http://jnl.io/15000investbinarycrypto29752

  6. SantosprerY 8 months ago
    Reply

    Get $1500 – $6000 per DAY: http://rih.co/bestinvestcrepto64277

  7. Danieljeale 8 months ago
    Reply

    J’ai 23 000 €. Comment l’utiliser au mieux pour gagner plus d’argent: http://www.vkvi.net/bestinvestcrepto36390

  8. Miguelmossy 8 months ago
    Reply

    Top cryptocurrencies to invest in 2019: http://perkele.ovh/bestinvestcrepto16831

  9. SantosprerY 8 months ago
    Reply

    $15,000 a month (30mins “work” lol): https://aaa.moda/5000perday66920

  10. ClintonDaulk 7 months ago
    Reply

    If you invested $1,000 in bitcoin in 2011, now you have $4 million: http://rih.co/72753

  11. SantosprerY 6 months ago
    Reply

    Wie man in bitcoins $ 5000 investiert – erzielt eine Rendite von bis zu 2000%: http://kristall43.ru/bitrix/rk.php?goto=http://bestprofits-for-you-21.com/?u=u348mwe&o=6h104vf&t=4er3r3rf&cid=rfr3rrf3

  12. ClintonDaulk 6 months ago
    Reply

    Ich bin 23. Ich habe € 3000. Wie kann ich es am besten nutzen, um mehr Geld zu verdienen: https://clck.ru/GQy6G

  13. ClintonDaulk 6 months ago
    Reply

    Ich bin 23. Ich habe € 3000. Wie kann ich es am besten nutzen, um mehr Geld zu verdienen: https://clck.ru/GQy6G

  14. ScottPlage 5 months ago
    Reply

    If you invested $1,000 in bitcoin in 2011, now you have $4 million: http://v.ht/fEmWN

  15. Ismaelariff 5 months ago
    Reply

    Investoi laillisiin kannabisyrityksiin: http://go-4.net/eeab

  16. Josephmax 5 months ago
    Reply

    Where to invest $ 3000 once and receive every month from $ 55000: https://s.coop/23017?&uoifi=rxumtmEEq35

  17. FrankDrutt 5 months ago
    Reply

    $200 for 10 mins “work?”: http://merky.de/uk5wro?KfiojQ6

  18. FrankDrutt 5 months ago
    Reply

    $200 for 10 mins “work?”: http://merky.de/uk5wro?KfiojQ6

  19. Ismaelariff 4 months ago
    Reply

    Binary options + cryptocurrency = $ 7000 per week: http://v.ht/neO1j?VX39gERpn5

  20. ScottPlage 4 months ago
    Reply

    Forex + Bitcoin = $ 7000 per week: http://merky.de/qwmq4k?&dprhw=EfcDBMXP27A3s

  21. Louiskix 4 months ago
    Reply

    Invest $ 5000 and get $ 55000 every month: https://hideuri.com/qPj5yw?&fhtar=vkS2EJekct

  22. Ismaelariff 4 months ago
    Reply

    Binary options + cryptocurrency = $ 7000 per week: http://merky.de/36r5c9?2npErh2Zb0Dv

  23. Ismaelariff 4 months ago
    Reply

    Paid Surveys: Earn $30,000 Or More Per Week: https://hideuri.com/xdMRp6?Ah9WSE

  24. Walteranert 4 months ago
    Reply

    Forex 1000 To 1 Million – Turning $10,000 into $1 Million in Forex: http://gulrewezom.tk/du4hj?LbKK9w

  25. Ismaelariff 4 months ago
    Reply

    If you invested $1,000 in bitcoin in 2011, now you have $4 million: http://worktifuncsi.tk/aoyp?&vjesd=2nN3MyBw

  26. TimothyPop 3 months ago
    Reply

    Get $1000 – $6000 A Day: http://baycritriewood.tk/lekv?hkXy4l

  27. ScottPlage 2 months ago
    Reply

    If you invested $1,000 in bitcoin in 2011, now you have $4 million: http://consnoundota.tk/vvw3?&dlztc=Ps0O7saz

  28. RichardBog 2 months ago
    Reply

    Get $1500 – $6000 per DAY: http://go-4.net/gG0f?lEIOLksawqe

  29. RichardBog 2 months ago
    Reply

    Paid Surveys: Earn $30,000 Or More Per Week: http://v.ht/T3NUwcF?&ydpzy=ONb7f2oebHw

  30. ScottPlage 2 months ago
    Reply

    What’s the easiest way to earn $30000 a month: http://snapunglestic.tk/e4pfx?r34akPT

  31. Derrickuncof 2 months ago
    Reply

    Get $1500 – $6000 per DAY: https://hec.su/l09W?WYdgA

  32. DanielUnicT 2 months ago
    Reply

    Buy Essays Canada – Cheap & Safe Online Writing Service: https://vk.cc/9OTGak?id=buyessayonline111e8o

  33. TimothyPop 2 months ago
    Reply

    Buy an Essay Online for Cheap 24/7 | 100% Original: https://vk.cc/9OTGak?id=buyessayonline112pml

  34. DanielUnicT 2 months ago
    Reply

    Buy Essays Online in Australia: https://vk.cc/9OTGak?id=buyessayonline111zk7

  35. Invest $ 7594 and get $ 22311 every month: https://cutt.us/e8dlM1yt?JgI7goLWi 1 month ago
    Reply

    Binary options + Cryptocurrency = $ 5536 per week: https://hideuri.com/qbDrrl?0vjDKF

  36. Danielcab 2 weeks ago
    Reply

    Invest $ 1626 and get $ 31961 every month: http://freeurlredirect.com/earnonebitcoinperday302557

  37. The Best Dissertation Service in The AU: https://1borsa.com/buyessayonline867317 5 days ago
    Reply

    #1 Essay Writing Service UK Students Trust. 100% Secure: https://jtbtigers.com/buyessayonline291242

Leave a Comment

Your email address will not be published.

You may also like

Pin It on Pinterest