Three biggest traps to avoid while setting Spark executor memory

Apache Spark is wasting a lot of RAM!

Bartosz Mikulski 16 Dec 2019 – 1 min read

What happens when you set the executor memory of a Spark worker which uses YARN as the cluster resource manager? Does it get exactly the amount of memory you requested? Unfortunately, no.

To makes things easy to comprehend, imagine that I have a Spark cluster with a small amount of memory. Let’s say I have 16GB of RAM.

I want to run a Spark job that requires 4GB of memory. What is going to happen? Will YARN allocate 4GB for that Spark job?

spark.executor.memoryOverhead

First, it is going to read the spark.executor.memoryOverhead parameter and multiply the requested amount of memory by the overhead value (by default, 10%, with a minimum of 384 MB).

Spark will add the overhead to the executor memory and, as a consequence, request 4506 MB of memory.

yarn.scheduler.minimum-allocation-mb

The yarn.scheduler.minimum-allocation-mb parameter causes the second problem.

I set the minimal allocation to 4GB, so YARN will always allocate at least 4GB for every Spark job. That is fine, but YARN can assign only the multiple of the minimum-allocation-mb value.

If yarn.scheduler.minimum-allocation-mb is set to 4 GB and I have 16GB of available memory, YARN can allocate 4GB, 8GB, 12GB, or 16GB.

So what is going to happen when Spark requests 4506MB of memory? YARN allocates 8GB! That is a huge amount of additional memory!

spark.executor.memory

Why is this a problem? Because Spark will never use that additional memory!

The spark.executor.memory parameter gets passed to the Java process running Spark and limits its memory usage. I have allocated 8GB of memory, but I can access only half of it! I am wasting a lot of RAM!

Three biggest traps to avoid while setting Spark executor memory

spark.executor.memoryOverhead

yarn.scheduler.minimum-allocation-mb

spark.executor.memory

How to use Airflow backfill to run DAGs for a specified date in the past?

What is the difference between data lake, data warehouse, and data mart

Three biggest traps to avoid while setting Spark executor memory

spark.executor.memoryOverhead

yarn.scheduler.minimum-allocation-mb

spark.executor.memory

How to use Airflow backfill to run DAGs for a specified date in the past?

What is the difference between data lake, data warehouse, and data mart

Related Posts

Theory of constraints in data engineering

Testing data products: BDD for data engineers

Definition of done for data engineers