Apache Spark: should we use RDD, Dataset, or DataFrame?
I remember having a problem deciding which data type I should use in my Spark code. I knew that RDDs are not optimized, but what is the difference between Dataset and DataFrame? Why do we even have both?
In this article, I am going to show you why you should not use RDDs directly, why Datasets are the best, and why you should not bother about DataFrames.
Why shouldn’t we use RDDs?
The biggest problem of RDDs is the fact that Spark can not optimize the code. It is not possible because RDDs provide only low-level API. The code we write does not express the result we want, but instead, we must provide detailed step-by-step instructions.
Such an API is quite useful when we are dealing with unstructured data. In that case, it is usually not a huge problem. After all, Spark will not optimize the processing of audio or image data, and we are not going to do any aggregations on such kind of data.
If we can extract some useful structure from the unstructured data, it is often beneficial to convert the RDDs into Datasets or DataFrames and continue processing using the SQL-like code to get at least some of the optimizations.
How does Spark optimize code?
When we write our Spark code using DataFrames, Datasets, or SQL queries, it gets converted into an Unresolved Logic Plan.
Every DataFrame and Dataset represents a logical table with data. Spark has a catalog of used tables and columns. That information allows it to resolve references between tables and create a Logical Plan of the query.
At this point, the Catalyst optimizer produces the Optimized Logical Plan.
Spark knows which worker nodes have the required partitions of data so it can use the Optimized Logical Plan to produce a Psychical Plan. There are multiple ways of converting the logical plan into a psychical plan, so this operation returns multiple psychical plans.
Spark estimates the cost of every psychical plan and selects the most optimal one.
The Selected Psychical Plan gets transformed into optimal RDD operations. That code gets executed by the worker nodes.
Why are Datasets the best?
Datasets provide two features that are required to produce the most optimal query execution plan:
First, Datasets are defined as SQL-like operations, so we are not telling Spark what is supposed to happen, but instead of that, we declare what the expected outcome is. Therefore, the optimizer can modify the code to get the desired result most optimally.
In addition to that, a Dataset has its schema. The schema defines the column names and their types. The schema allows Spark to not only apply type-specific optimizations but also check whether the query is correct before it gets executed.
Because of that, we are going to get an error message sooner. It is always good to see information about a missing column before Spark starts executing the query, not after waiting for six hours to get the result.
What about DataFrames?
1 type DataFrame = Dataset[Row]
Since Spark 2.0, Datasets and DataFrames have been unified. Now, DataFrames are defined as Datasets of type Row. I chose to pretend that DataFrames don’t exist anymore ;)
Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?
Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!
You may also like
- How to save an Apache Spark DataFrame as a dynamically partitioned table in Hive
- What is the difference between CUBE and ROLLUP and how to use it in Apache Spark?
- How to use the window function to get a single row from each group in Apache Spark
- What is the difference between repartition and coalesce in Apache Spark?
- How Airflow scheduler works
- Data/MLOps engineer by day
- DevRel/copywriter by night
- Python and data engineering trainer
- Conference speaker
- Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
- Twitter: @mikulskibartosz