Some notes based on two videos describing Apache Spark concepts.
https://www.youtube.com/watch?v=AoVmgzontXo - Spark SQL: A Compiler from Queries to RDDs: Spark Summit East talk by Sameer Agarwal https://www.youtube.com/watch?v=vfiJQ7wg81Y - Top 5 Mistakes When Writing Spark Applications. Spark SQL: A Compiler from Queries to RDDs: Spark Summit East talk by Sameer Agarwal https://youtu.be/AoVmgzontXo?t=641 - an example of the optimization done by Catalyst (better to watch the whole video to get better understanding of the whole context).
Spark is delightful for Big Data analysis. It allows using very high-level code to perform a large variety of operations. It also supports SQL, so you don’t need to learn a lot of new stuff to start being productive in Spark (of course assuming that you have some knowledge of SQL).
However, if you want to use Spark more efficiently, you need to learn a lot of concepts, especially about data partitioning, relations between partitions (narrow dependencies vs.
I use blogdown to write my blog posts. It allows me to create a Rmarkdown file, and then execute all the code and format the output. It has great support for R (it’s R native) and Python. Some other languages are also supported, but the functionality is pretty limited. For example, each code chunk is evaluated in a separate session (I’m not sure if it’s the case for all engines, I read about this in https://yihui.
Some time ago I had to move from sparklyr to Scala for better integration with Spark, and easier collaboration with other developers in a team. Interestingly, this conversion was much easier than I thought because Spark’s DataFrame API is somewhat similar to dplyr, there’s groupBy function, agg instead of summarise, and so on. You can also use traditional, old SQL to operate on data frames. Anyway, in this post, I’ll show how to fit very simple LDA (Latent Dirichlet allocation) model, and then extract information about topic’s words.