# Row number in Apache Spark window — row_number, rank, and dense_rank

Which function should we use to rank the rows within a window in Apache Spark data frame?

It depends on the expected output. row_number is going to sort the output by the column specified in orderBy function and return the index of the row (human-readable, so starts from 1).

The only difference between rank and dense_rank is the fact that the rank function is going to skip the numbers if there are duplicates assigned to the same rank. In the same situation, the dense_rank function uses the next number in a sequence.

Are you interested in data engineering?

Check out my other blog https://easydata.engineering

I found this great example on StackOverflow that seems to explain everything:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val df = Seq(("a", 10), ("a", 10), ("a", 20)).toDF("col1", "col2")

val windowSpec = Window.partitionBy("col1").orderBy("col2")

df
.withColumn("rank", rank().over(windowSpec))
.withColumn("dense_rank", dense_rank().over(windowSpec))
.withColumn("row_number", row_number().over(windowSpec)).show

+----+----+----+----------+----------+
|col1|col2|rank|dense_rank|row_number|
+----+----+----+----------+----------+
|   a|  10|   1|         1|         1|
|   a|  10|   1|         1|         2|
|   a|  20|   3|         2|         3|
+----+----+----+----------+----------+


Remember to share on social media!