When to cache an Apache Spark DataFrame?
When should we cache a Spark DataFrame? Wouldn’t it be easier to cache everything? Well, no. That would be a terrible idea because when we use the
cache function, Spark is going to spend some time serializing and storing the DataFrame even if we don’t reuse it later.
In my opinion, there are three rules and guidelines regarding caching in Apache Spark:
Cache only what is reused
It is crucial to remember that caching a DataFrame that is used only once is a waste of resources and makes no sense. We should never do it.
Make sure that you have all of the columns
If in one statement, we use columns A, B, and C, but the other statement needs columns B, C, D, it makes no sense to cache any of those DataFrames. In this situation, we should cache the superset that contains all of the columns we are going to need (A, B, C, D), so both statements can use the cached data.
1 2 3 4 5 # This does not make sense: firstDf = df.select('A', 'B', 'C').cache() secondDf = df.select('B', 'C', 'D').cache() ... some operations that use firstDf and secondDf multiple times
Everything that happens before those cache function calls must be calculated twice, and we keep two copies of B, C columns. Instead of that, let’s do it like this:
1 2 3 4 superSet = df.select('A', 'B', 'C', 'D').cache() firstDf = superSet.select('A', 'B', 'C') secondDf = superSet.select('B', 'C', 'D')
Use unpersist (sometimes)
Usually, instructing Spark to remove a cached DataFrame is overkill and makes as much sense as assigning a null to no longer used local variable in a Java method. However, there is one exception.
Imagine that I have cached three DataFrames:
1 2 3 firstDf = df.something.cache() secondDf = df.something.cache() thirdDf = df.something.cache()
Now, I would like to cache more DataFrames, but I know that I no longer need the third DataFrame. I can use
unpersist to tell Spark what it can remove from the cache. Otherwise, it uses the least-recently-used method and may remove something I will want to use later. Therefore, by telling Spark what I no longer need, I may avoid waiting until Spark recomputes something removed just because it was a long time since I used it.
You may also like
- What is the difference between cache and persist in Apache Spark?
- PySpark-Check - data quality validation for PySpark 3.0.0
- How to combine two DataFrames with no common columns in Apache Spark
- What is the difference between CUBE and ROLLUP and how to use it in Apache Spark?
- How to save an Apache Spark DataFrame as a dynamically partitioned table in Hive