Building trustworthy data pipelines because AI cannot learn from dirty data
How to derive multiple columns from a single column in a PySpark DataFrame
Extract multiple columns from a single column using the withColumn function and a PySpark UDF
29 Sep 2020
Broadcast variables and broadcast joins in Apache Spark
How to speed up joins of small DataFrames by using the broadcast join
28 Sep 2020
How to use the window function to get a single row from each group in Apache Spark
How to group values by a key and extract a single row from each group in Apache Spark
27 Sep 2020
How to make a pivot table in AWS Athena or PrestoSQL
How to make a pivot table in AWS Athena, and why the pivot function does not exist
26 Sep 2020
What is the difference between repartition and coalesce in Apache Spark?
When should you use coalesce instead of repartition in Apache Spark
25 Sep 2020
How to pivot an Apache Spark DataFrame
How to turn an Apache Spark or PySpark DataFrame into a pivot table.
24 Sep 2020
What is the difference between cache and persist in Apache Spark?
When should you use the cache, and when you should use the persist function
23 Sep 2020
Why your company should use PrestoSQL
Should your team use PrestoSQL?
16 Sep 2020
Is counting rows all we can do?
How to detect problems in data pipelines before they turn into hard to debug bugs? I wish I knew.
08 Sep 2020
How to Speed Up AWS Athena Queries Using Partition Projection
How to define partition projection while creating an Athena table
30 Aug 2020
How to send a customized Slack notification when an Airflow task fails
How to customize a Slack notification before sending it to the Slack incoming webhook.
27 Aug 2020
How to use one SparkSession to run all Pytest tests
How to speed us Pytest tests by reusing the same SparkSession in all of them
20 Jul 2020