What is the difference between a transformation and an action in Apache Spark?

This article is a part of my "100 data engineering tutorials in 100 days" challenge. (25/100)

The concepts of actions and transformations in Apache Spark are not something we think about every day while using Spark, but occasionally it is good to know the distinction.

First of all, we may be asked that question during a job interview. It is a good question because it helps spot people who did not bother reading the documentation of the tool they are using. Fortunately, after reading this article, you will know the difference.

In addition to that, we may need to know the difference to explain why the code we are reviewing is too slow. For example, some data engineers have strange ideas like calculating counts in the middle of a Spark job to log the number of rows or store them as a metric. While a well-placed count may help to debug and tremendously speed up problem-solving, counting the number of rows in every other line of code is massive overkill. The difference between a transformation and an action helps us explain why doing it is a significant bottleneck.

What is a transformation?

A transformation is every Spark operation that returns a DataFrame, Dataset, or an RDD. When we build a chain of transformations, we add building blocks to the Spark job, but no data gets processed. That is possible because transformations are lazy executed. Spark will calculate the value when it is necessary.

Of course, this also means that Spark needs to recompute the values when we re-use the same transformations. We can avoid that by using the persist or cache functions.

Subscribe to the newsletter and join the free email course.

What is an action?

Actions, on the other hand, are not lazily executed. When we put an action in the code and Spark reaches that line of code when running the job, it will have to perform all of the transformations that lead to that action to produce a value.

Producing value is the key concept here. While transformations return one of the Spark data types, actions return a count of elements (for example, the count function), a list of them (collect, take, etc.), or store the data in external storage (write, saveAsTextFile, and others).

How to tell the difference

When we look at the Spark API, we can easily spot the difference between transformations and actions. If a function returns a DataFrame, Dataset, or RDD, it is a transformation. If it returns anything else or does not return a value at all (or returns Unit in the case of Scala API), it is an action.

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you want to contact me, send me a message on LinkedIn or Twitter.

Would you like to have a call and talk? Please schedule a meeting using this link.

Bartosz Mikulski
Bartosz Mikulski * MLOps Engineer / data engineer * conference speaker * co-founder of Software Craft Poznan & Poznan Scala User Group

Subscribe to the newsletter and get access to my free email course on building trustworthy data pipelines.