How to configure Spark to maximize resource usage while using AWS EMR
When we run a Spark cluster on EMR, we often create a new cluster for every Spark job. In that case, we want to use all available resources, but changing the configuration is annoying and error-prone. How many times have you forgotten to change the Spark settings after changing the EMR instances to less powerful ones? Of course, the Spark job failed because it could not allocate the resources you wanted.
What is even worse, when you forget to change the settings after changing the instance to a bigger one, you pay for a better cluster, but you are not using it entirely.
Fortunately, Spark’s EMR version has a special configuration parameter that replaces all of the cumbersome parameters, such as the executor memory, the executor cores, or parallelism.
Instead of them, we should enable the maximizeResourceAllocation feature:
1 --conf maximizeResourceAllocation=true
when we call the
You may also like
- How to derive multiple columns from a single column in a PySpark DataFrame
- How Data Mechanics can reduce your Apache Spark costs by 70%
- How to use the window function to get a single row from each group in Apache Spark
- When to cache an Apache Spark DataFrame?
- What is the difference between repartition and coalesce in Apache Spark?