When to cache an Apache Spark DataFrame?

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (11/100)

When should we cache a Spark DataFrame? Wouldn’t it be easier to cache everything? Well, no. That would be a terrible idea because when we use the cache function, Spark is going to spend some time serializing and storing the DataFrame even if we don’t reuse it later.

In my opinion, there are three rules and guidelines regarding caching in Apache Spark:

Cache only what is reused

It is crucial to remember that caching a DataFrame that is used only once is a waste of resources and makes no sense. We should never do it.

Make sure that you have all of the columns

If in one statement, we use columns A, B, and C, but the other statement needs columns B, C, D, it makes no sense to cache any of those DataFrames. In this situation, we should cache the superset that contains all of the columns we are going to need (A, B, C, D), so both statements can use the cached data.

1
2
3
4
5
# This does not make sense:

firstDf = df.select('A', 'B', 'C').cache()
secondDf = df.select('B', 'C', 'D').cache()
... some operations that use firstDf and secondDf multiple times

Everything that happens before those cache function calls must be calculated twice, and we keep two copies of B, C columns. Instead of that, let’s do it like this:

1
2
3
4
superSet = df.select('A', 'B', 'C', 'D').cache()

firstDf = superSet.select('A', 'B', 'C')
secondDf = superSet.select('B', 'C', 'D')

Use unpersist (sometimes)

Usually, instructing Spark to remove a cached DataFrame is overkill and makes as much sense as assigning a null to no longer used local variable in a Java method. However, there is one exception.

Imagine that I have cached three DataFrames:

1
2
3
firstDf = df.something.cache()
secondDf = df.something.cache()
thirdDf = df.something.cache()

Now, I would like to cache more DataFrames, but I know that I no longer need the third DataFrame. I can use unpersist to tell Spark what it can remove from the cache. Otherwise, it uses the least-recently-used method and may remove something I will want to use later. Therefore, by telling Spark what I no longer need, I may avoid waiting until Spark recomputes something removed just because it was a long time since I used it.

Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • Data/MLOps engineer by day
  • DevRel/copywriter by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.