Row number in Apache Spark window — row_number, rank, and dense_rank
This article is mostly a “note to self” because I don’t want to google that anymore ;)
Which function should we use to rank the rows within a window in Apache Spark data frame?
It depends on the expected output. row_number is going to sort the output by the column specified in orderBy function and return the index of the row (human-readable, so starts from 1).
The only difference between rank and dense_rank is the fact that the rank function is going to skip the numbers if there are duplicates assigned to the same rank. In the same situation, the dense_rank function uses the next number in a sequence.
I found this great example on StackOverflow that seems to explain everything:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val df = Seq(("a", 10), ("a", 10), ("a", 20)).toDF("col1", "col2")
val windowSpec = Window.partitionBy("col1").orderBy("col2")
df
.withColumn("rank", rank().over(windowSpec))
.withColumn("dense_rank", dense_rank().over(windowSpec))
.withColumn("row_number", row_number().over(windowSpec)).show
+----+----+----+----------+----------+
|col1|col2|rank|dense_rank|row_number|
+----+----+----+----------+----------+
| a| 10| 1| 1| 1|
| a| 10| 1| 1| 2|
| a| 20| 3| 2| 3|
+----+----+----+----------+----------+
Source: https://stackoverflow.com/questions/44968912/difference-in-dense-rank-and-row-number-in-spark
You may also like
- Calculating the cumulative sum of a group using Apache Spark
- Dependencies between DAGs: How to wait until another DAG finishes in Airflow?
- Apache Spark: should we use RDD, Dataset, or DataFrame?
- Three biggest traps to avoid while setting Spark executor memory
- How to scrape a single web page using Scrapy in Jupyter Notebook?
Remember to share on social media! If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.
If you want to contact me, send me a message on LinkedIn or Twitter.
Would you like to have a call and talk? Please schedule a meeting using this link.
