When to cache an Apache Spark DataFrame?

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (11/100)

When should we cache a Spark DataFrame? Wouldn’t it be easier to cache everything? Well, no. That would be a terrible idea because when we use the cache function, Spark is going to spend some time serializing and storing the DataFrame even if we don’t reuse it later.

In my opinion, there are three rules and guidelines regarding caching in Apache Spark:

Cache only what is reused

It is crucial to remember that caching a DataFrame that is used only once is a waste of resources and makes no sense. We should never do it.

Make sure that you have all of the columns

If in one statement, we use columns A, B, and C, but the other statement needs columns B, C, D, it makes no sense to cache any of those DataFrames. In this situation, we should cache the superset that contains all of the columns we are going to need (A, B, C, D), so both statements can use the cached data.

# This does not make sense:

firstDf = df.select('A', 'B', 'C').cache()
secondDf = df.select('B', 'C', 'D').cache()
... some operations that use firstDf and secondDf multiple times

Everything that happens before those cache function calls must be calculated twice, and we keep two copies of B, C columns. Instead of that, let’s do it like this:

superSet = df.select('A', 'B', 'C', 'D').cache()

firstDf = superSet.select('A', 'B', 'C')
secondDf = superSet.select('B', 'C', 'D')

Subscribe to the newsletter and join the free email course.

Use unpersist (sometimes)

Usually, instructing Spark to remove a cached DataFrame is overkill and makes as much sense as assigning a null to no longer used local variable in a Java method. However, there is one exception.

Imagine that I have cached three DataFrames:

firstDf = df.something.cache()
secondDf = df.something.cache()
thirdDf = df.something.cache()

Now, I would like to cache more DataFrames, but I know that I no longer need the third DataFrame. I can use unpersist to tell Spark what it can remove from the cache. Otherwise, it uses the least-recently-used method and may remove something I will want to use later. Therefore, by telling Spark what I no longer need, I may avoid waiting until Spark recomputes something removed just because it was a long time since I used it.

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.

Bartosz Mikulski
Bartosz Mikulski * MLOps Engineer / data engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.