Building trustworthy data pipelines because AI cannot learn from dirty data
What is the difference between CUBE and ROLLUP and how to use it in Apache Spark?
Desc: How to use the cube and rollup functions in Apache Spark or PySpark. What is the difference between a cube and a rollup.
01 Oct 2020
How to concatenate columns in a PySpark DataFrame
How to use the concat and concat_ws functions to merge multiple columns into one in PySpark
30 Sep 2020
How to derive multiple columns from a single column in a PySpark DataFrame
Extract multiple columns from a single column using the withColumn function and a PySpark UDF
29 Sep 2020
Broadcast variables and broadcast joins in Apache Spark
How to speed up joins of small DataFrames by using the broadcast join
28 Sep 2020
How to use the window function to get a single row from each group in Apache Spark
How to group values by a key and extract a single row from each group in Apache Spark
27 Sep 2020
How to make a pivot table in AWS Athena or PrestoSQL
How to make a pivot table in AWS Athena, and why the pivot function does not exist
26 Sep 2020
What is the difference between repartition and coalesce in Apache Spark?
When should you use coalesce instead of repartition in Apache Spark
25 Sep 2020
How to pivot an Apache Spark DataFrame
How to turn an Apache Spark or PySpark DataFrame into a pivot table.
24 Sep 2020
What is the difference between cache and persist in Apache Spark?
When should you use the cache, and when you should use the persist function
23 Sep 2020
Why your company should use PrestoSQL
Should your team use PrestoSQL?
16 Sep 2020
Is counting rows all we can do?
How to detect problems in data pipelines before they turn into hard to debug bugs? I wish I knew.
08 Sep 2020
How to Speed Up AWS Athena Queries Using Partition Projection
How to define partition projection while creating an Athena table
30 Aug 2020