Row number in Apache Spark window — row_number, rank, and dense_rank

This article is mostly a “note to self” because I don’t want to google that anymore ;)

Which function should we use to rank the rows within a window in Apache Spark data frame?

It depends on the expected output. row_number is going to sort the output by the column specified in orderBy function and return the index of the row (human-readable, so starts from 1).

The only difference between rank and dense_rank is the fact that the rank function is going to skip the numbers if there are duplicates assigned to the same rank. In the same situation, the dense_rank function uses the next number in a sequence.

I found this great example on StackOverflow that seems to explain everything:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val df = Seq(("a", 10), ("a", 10), ("a", 20)).toDF("col1", "col2")

val windowSpec = Window.partitionBy("col1").orderBy("col2")

df
  .withColumn("rank", rank().over(windowSpec))
  .withColumn("dense_rank", dense_rank().over(windowSpec))
  .withColumn("row_number", row_number().over(windowSpec)).show

+----+----+----+----------+----------+
|col1|col2|rank|dense_rank|row_number|
+----+----+----+----------+----------+
|   a|  10|   1|         1|         1|
|   a|  10|   1|         1|         2|
|   a|  20|   3|         2|         3|
+----+----+----+----------+----------+

Source: https://stackoverflow.com/questions/44968912/difference-in-dense-rank-and-row-number-in-spark

Did you enjoy reading this article?
Would you like to learn more about leveraging AI to drive growth and innovation, software craft in data engineering, and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • MLOps engineer by day
  • AI and data engineering consultant by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
  • Mastodon: @mikulskibartosz@mathstodon.xyz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.