Calculating the cumulative sum of a group using Apache Spark
A cumulative sum (or a running total) is a sequence of partial sums of a given sorted dataset. In this article, I will explain how to use Apache Spark to calculate the cumulative sum of values grouped by a column.
Imagine that I have loaded a
data Spark dataset, which contains credit card transactions. The dataset consists of two columns: date (a date column) and amount (float type). I want to group the payments by year and month and calculate the total amount spent during that month.
To create a single grouping column, I concatenate the year to the month:
1 val withMonth = data.withColumn("yearWithMonth", concat(year($"date"), month($"date")))
Now, it is time to define the window used to calculate the cumulative sum. I use the newly created column as my partitioning key:
1 2 val window = Window .partitionBy($"yearWithMonth")
As the second parameter, I specify the order, because I want the payments to be sorted:
1 2 3 val window = Window .partitionBy($"yearWithMonth") .orderBy($"date".asc)
Finally, I use the
rowsBetween function to specify the window range (note that you should NOT use the
rangeBetween function, because it works on the actual values of the rows, not their position. In this case we want to group by position within the partition).
It crates a window that contains values between the first row of the dataset (sorted) and the currently processed row:
1 2 3 4 val window = Window .partitionBy($"yearWithMonth") .orderBy($"date".asc) .rowsBetween(Window.unboundedPreceding, Window.currentRow)
Now, I can use the
sum function with the window to get the cumulative sum:
1 withMonth.withColumn("spentPerMonth", sum($"spent").over(window))
A quick warning. It turned out that I did not want to know the total value of my credit card transactions since the day I opened the bank account. Make sure that you want to see it before you use your credit card data ;)
Did you enjoy reading this article?
Would you like to learn more about software craft in data engineering and MLOps?
Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!
You may also like
- Theory of constraints in data engineering
- Data streaming with Apache Kafka - guide for data engineers
- Dependencies between DAGs: How to wait until another DAG finishes in Airflow?
- Three biggest traps to avoid while setting Spark executor memory
- Making your Scrapy spider undetectable by applying basic statistics
- Data/MLOps engineer by day
- DevRel/copywriter by night
- Python and data engineering trainer
- Conference speaker
- Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
- Twitter: @mikulskibartosz