How to configure Spark to maximize resource usage while using AWS EMR

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (21/100)

When we run a Spark cluster on EMR, we often create a new cluster for every Spark job. In that case, we want to use all available resources, but changing the configuration is annoying and error-prone. How many times have you forgotten to change the Spark settings after changing the EMR instances to less powerful ones? Of course, the Spark job failed because it could not allocate the resources you wanted.

What is even worse, when you forget to change the settings after changing the instance to a bigger one, you pay for a better cluster, but you are not using it entirely.

Fortunately, Spark’s EMR version has a special configuration parameter that replaces all of the cumbersome parameters, such as the executor memory, the executor cores, or parallelism.

Instead of them, we should enable the maximizeResourceAllocation feature:

--conf maximizeResourceAllocation=true

when we call the spark-submit script.

Subscribe to the newsletter and join the free email course.

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.

Bartosz Mikulski
Bartosz Mikulski * MLOps Engineer / data engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.