Building trustworthy data pipelines because AI cannot learn from dirty data
How to pivot an Apache Spark DataFrame
How to turn an Apache Spark or PySpark DataFrame into a pivot table.
24 Sep 2020
What is the difference between cache and persist in Apache Spark?
When should you use the cache, and when you should use the persist function
23 Sep 2020
Why your company should use PrestoSQL
Should your team use PrestoSQL?
16 Sep 2020
Is counting rows all we can do?
How to detect problems in data pipelines before they turn into hard to debug bugs? I wish I knew.
08 Sep 2020
How to Speed Up AWS Athena Queries Using Partition Projection
How to define partition projection while creating an Athena table
30 Aug 2020
How to send a customized Slack notification when an Airflow task fails
How to customize a Slack notification before sending it to the Slack incoming webhook.
27 Aug 2020
How to use one SparkSession to run all Pytest tests
How to speed us Pytest tests by reusing the same SparkSession in all of them
20 Jul 2020
How to send AWS CloudWatch Alerts to a Slack channel using Terraform
How to use Terraform to configure a CloudWatch alert and send the message to a Slack channel.
13 Jul 2020
Check-Engine - data quality validation for PySpark 3.0.0
Last week, I was testing whether we can use AWS Deequ for data quality validation. I ran into a few problems. First of all, it was using an outdated version...
06 Jul 2020
Measuring data quality using AWS Deequ
How to measure data quality in Athena tables using AWS Deequ running on an EMR cluster.
29 Jun 2020
How to conditionally skip tasks in an Airflow DAG
How to use XCom and PythonSensor to skip remaining tasks in an Airflow DAG.
22 Jun 2020
The problem with software testing in data engineering
What if we found a bug in our data pipelines? What if that bug were easy to fix, but it would require a lot of time spent backfilling the data?...
15 Jun 2020