Three biggest traps to avoid while setting Spark executor memory

What happens when you set the executor memory of a Spark worker which uses YARN as the cluster resource manager? Does it get exactly the amount of memory you requested? Unfortunately, no.

To makes things easy to comprehend, imagine that I have a Spark cluster with a small amount of memory. Let’s say I have 16GB of RAM.

I want to run a Spark job that requires 4GB of memory. What is going to happen? Will YARN allocate 4GB for that Spark job?

spark.executor.memoryOverhead

First, it is going to read the spark.executor.memoryOverhead parameter and multiply the requested amount of memory by the overhead value (by default, 10%, with a minimum of 384 MB).

Spark will add the overhead to the executor memory and, as a consequence, request 4506 MB of memory.

yarn.scheduler.minimum-allocation-mb

The yarn.scheduler.minimum-allocation-mb parameter causes the second problem.

I set the minimal allocation to 4GB, so YARN will always allocate at least 4GB for every Spark job. That is fine, but YARN can assign only the multiple of the minimum-allocation-mb value.

If yarn.scheduler.minimum-allocation-mb is set to 4 GB and I have 16GB of available memory, YARN can allocate 4GB, 8GB, 12GB, or 16GB.

So what is going to happen when Spark requests 4506MB of memory? YARN allocates 8GB! That is a huge amount of additional memory!

spark.executor.memory

Why is this a problem? Because Spark will never use that additional memory!

The spark.executor.memory parameter gets passed to the Java process running Spark and limits its memory usage. I have allocated 8GB of memory, but I can access only half of it! I am wasting a lot of RAM!

Newer post

What is the difference between data lake, data warehouse, and data mart

We can easily distinguish between them by focusing on three qualities: data structure (schema), data quality, and ownership.