Some notes based on two videos describing Apache Spark concepts.

Spark SQL: A Compiler from Queries to RDDs: Spark Summit East talk by Sameer Agarwal

Top 5 Mistakes When Writing Spark Applications

Example of the resource calculations (number of executors, cores and memory for each executor). Example for 6 nodes, 16 cores and 64GB on each node.

We need to leave one core for YARN, so there 15 cores left per node. Then to maximize HDFS troughput max 5 cores per executor should be used, so there should be 3 executors per node.

Then we need to leave 1 GB memory for OS, so 63GB is available. Then 7% of the allocated memory should goes to YARN so the final calcualtion is: (64-1) / 3 = 21, 21 * 0.97 ~= 19GB 1 executor for YARN AM => 17 executors.

No - dynamic allocation does not help - in only solves the problem of num-executors, and allows to deactivate them when they’re not used (e.g. user left Spark Shell opened), but the number of cores and memory must be specified manually.

Mistake 2:

https://youtu.be/vfiJQ7wg81Y?t=597: SIze exceeds Integer.MAX_VALUE - no shuffle block can be greater than 2 GB.

There’s a good description what’s the shuffle block.

JIRA issue related to this https://youtu.be/vfiJQ7wg81Y?t=597 (mentioned in the presentation).

Rule of thumb - 128MB per partition - the size of the shuffle block.

https://youtu.be/vfiJQ7wg81Y?t=383

Mistake 3 - skewed data

Add salting - Key + random.nextInt(saltFactor).

Mistake 4 - no such method exception.

Occures when there’s difference between version of the libs used by Spark and the user.

Solution - shading - https://youtu.be/vfiJQ7wg81Y?t=1397.