What is the difference between a transformation and an action in Apache Spark?
The concepts of actions and transformations in Apache Spark are not something we think about every day while using Spark, but occasionally it is good to know the distinction.
First of all, we may be asked that question during a job interview. It is a good question because it helps spot people who did not bother reading the documentation of the tool they are using. Fortunately, after reading this article, you will know the difference.
In addition to that, we may need to know the difference to explain why the code we are reviewing is too slow. For example, some data engineers have strange ideas like calculating counts in the middle of a Spark job to log the number of rows or store them as a metric. While a well-placed count may help to debug and tremendously speed up problem-solving, counting the number of rows in every other line of code is massive overkill. The difference between a transformation and an action helps us explain why doing it is a significant bottleneck.
What is a transformation?
A transformation is every Spark operation that returns a DataFrame, Dataset, or an RDD. When we build a chain of transformations, we add building blocks to the Spark job, but no data gets processed. That is possible because transformations are lazy executed. Spark will calculate the value when it is necessary.
Of course, this also means that Spark needs to recompute the values when we re-use the same transformations. We can avoid that by using the persist or cache functions.
What is an action?
Actions, on the other hand, are not lazily executed. When we put an action in the code and Spark reaches that line of code when running the job, it will have to perform all of the transformations that lead to that action to produce a value.
Producing value is the key concept here. While transformations return one of the Spark data types, actions return a count of elements (for example, the
count function), a list of them (
take, etc.), or store the data in external storage (
saveAsTextFile, and others).
How to tell the difference
When we look at the Spark API, we can easily spot the difference between transformations and actions. If a function returns a
RDD, it is a transformation. If it returns anything else or does not return a value at all (or returns Unit in the case of Scala API), it is an action.
Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?
Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!
You may also like
- How to read multiple Parquet files with different schemas in Apache Spark
- When to cache an Apache Spark DataFrame?
- What is shuffling in Apache Spark, and when does it happen?
- Apache Spark: should we use RDD, Dataset, or DataFrame?
- How to save an Apache Spark DataFrame as a dynamically partitioned table in Hive
- Data/MLOps engineer by day
- DevRel/copywriter by night
- Python and data engineering trainer
- Conference speaker
- Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
- Twitter: @mikulskibartosz