Apache Spark: should we use RDD, Dataset, or DataFrame?
I remember having a problem deciding which data type I should use in my Spark code. I knew that RDDs are not optimized, but what is the difference between Dataset and DataFrame? Why do we even have both?
In this article, I am going to show you why you should not use RDDs directly, why Datasets are the best, and why you should not bother about DataFrames.
Why shouldn’t we use RDDs?
The biggest problem of RDDs is the fact that Spark can not optimize the code. It is not possible because RDDs provide only low-level API. The code we write does not express the result we want, but instead, we must provide detailed step-by-step instructions.
Such an API is quite useful when we are dealing with unstructured data. In that case, it is usually not a huge problem. After all, Spark will not optimize the processing of audio or image data, and we are not going to do any aggregations on such kind of data.
If we can extract some useful structure from the unstructured data, it is often beneficial to convert the RDDs into Datasets or DataFrames and continue processing using the SQL-like code to get at least some of the optimizations.
Parsing machine learning logs with Ahana, a managed Presto service, and Cube, a headless BI solution
Check out my article published on the Cube.dev blog!
How does Spark optimize code?
When we write our Spark code using DataFrames, Datasets, or SQL queries, it gets converted into an Unresolved Logic Plan.
Every DataFrame and Dataset represents a logical table with data. Spark has a catalog of used tables and columns. That information allows it to resolve references between tables and create a Logical Plan of the query.
At this point, the Catalyst optimizer produces the Optimized Logical Plan.
Spark knows which worker nodes have the required partitions of data so it can use the Optimized Logical Plan to produce a Psychical Plan. There are multiple ways of converting the logical plan into a psychical plan, so this operation returns multiple psychical plans.
Spark estimates the cost of every psychical plan and selects the most optimal one.
The Selected Psychical Plan gets transformed into optimal RDD operations. That code gets executed by the worker nodes.
Why are Datasets the best?
Datasets provide two features that are required to produce the most optimal query execution plan:
First, Datasets are defined as SQL-like operations, so we are not telling Spark what is supposed to happen, but instead of that, we declare what the expected outcome is. Therefore, the optimizer can modify the code to get the desired result most optimally.
In addition to that, a Dataset has its schema. The schema defines the column names and their types. The schema allows Spark to not only apply type-specific optimizations but also check whether the query is correct before it gets executed.
Because of that, we are going to get an error message sooner. It is always good to see information about a missing column before Spark starts executing the query, not after waiting for six hours to get the result.
What about DataFrames?
1 type DataFrame = Dataset[Row]
Since Spark 2.0, Datasets and DataFrames have been unified. Now, DataFrames are defined as Datasets of type Row. I chose to pretend that DataFrames don’t exist anymore ;)
You may also like
- How to get names of columns with missing values in PySpark
- When to cache an Apache Spark DataFrame?
- How to write to a Parquet file in Scala without using Apache Spark
- Three biggest traps to avoid while setting Spark executor memory
- How to derive multiple columns from a single column in a PySpark DataFrame
- Data/MLOps engineer by day
- DevRel/copywriter by night
- Python and data engineering trainer
- Conference speaker
- Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
- Twitter: @mikulskibartosz